# Michael Christoph Thrun

# Projection-Based Clustering through Self-Organization and Swarm Intelligence

Combining Cluster Analysis with the Visualization of High-Dimensional Data

Projection-Based Clustering through Self-Organization and Swarm Intelligence Michael Christoph Thrun

# Projection-Based Clustering through Self-Organization and Swarm Intelligence

Combining Cluster Analysis with the Visualization of High-Dimensional Data

Michael Christoph Thrun Marburg, Germany

Philipps-Universität Marburg 2017, Hochschulkennziffer 1180

ISBN 978-3-658-20539-3 ISBN 978-3-658-20540-9 (eBook) https://doi.org/10.1007/978-3-658-20540-9

Library of Congress Control Number: 2017963649

#### Springer Vieweg

© The Editor(s) (if applicable) and The Author(s) 2018. This book is an open access publication.

**Open Access** This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Printed on acid-free paper

This Springer Vieweg imprint is published by Springer Nature The registered company is Springer Fachmedien Wiesbaden GmbH The registered company address is: Abraham-Lincoln-Str. 46, 65189 Wiesbaden, Germany

# **Acknowledgments**

My gratitude goes to my sister Monika Sikora. You have converted with great devotion my often complex phrases and contexts into intelligible words. I thank you for your intuitive understanding of language.

I would like to thank Prof. Dr. Alfred Ultsch for his demanding scientific guidance and continuing education. His leadership and preparatory work have provided me with the tools to meet the demands of the job. Your suggestions have given me the creativity that was often necessary to redirect my research when a solution was not obvious.

Without my student colleagues, Felix Pape and Florian Lerch, some of the ideas sprouted in this work would not have been feasible. I owe you, Felix, because of your selfless voluntary commitment to the realization of manufactured 3D printing models of the U-matrices. The fundamental code for visualizing the U-matrices in the form of a topographic map would not have been possible without your cooperation, for which I thank you, Florian.

The positive working environment, which allowed me to flourish, I largely owe to my colleague Catharina Lippmann. Your collegiality and constructive cooperation created the atmosphere that was continuously aiding my systematic research.



# **List of figures**






# **List of tables**


# **Zusammenfassung**

Die vorliegende Arbeit befasst sich mit einem neuen Ansatz zur Clusteranalyse hochdimensionaler Daten. Die projektionsbasierte Clusteranalyse verbindet in zwei Dimensionen erhaltenen Strukturen mit zugrunde liegenden hochdimensionalen Strukturen.

Hierbei werden Cluster als natürlich definiert, wenn sie auf hochdimensionalen Daten beruhen, welche Diskontinuitäten aufweisen. Solche distanz- oder dichtebasierte Diskontinuitäten bezeichnen entweder kompakte oder verbundene Strukturen. Natürliche Cluster mit kompakten Strukturen werden hauptsächlich durch Inter- und Intra-Cluster-Distanzen definiert, während verbundene Strukturen auf dem Prinzip von Nachbarschaften zwischen Datenpunkten beruhen. Mit Hilfe auf der Graphentheorie begründeten Grundprinzipien und den in dieser Arbeit durchgeführten Untersuchungen lässt sich schlussfolgern, dass zum Erreichen einer Visualisierung oder Clusteranalyse die Optimierung einer mathematischen Zielfunktion irreführende Ergebnisse bezüglich der Struktur liefern kann, wenn die zugrunde liegenden Strukturen der verwendeten hochdimensionalen Daten dieser Zielfunktion nicht entsprechen.

Diese Arbeit geht der Fragestellung nach, wie man einen korrekten Typ von Strukturen herausfinden kann, der Cluster in einem hochdimensionalen Datensatz ohne Vorannahmen definiert. Es wird dargelegt, dass Verfahren der Dimensionsreduktion helfen können, dieses Problem zu lösen.

Projektionsverfahren stellen einen gängigen Ansatz zur Dimensionalitätsreduktion hochdimensionaler Daten dar. Sie werden verwendet, um die Größe des Eingaberaumes zu reduzieren um dadurch eine Visualisierung der hochdimensionalen Daten zu ermöglichen. Durch die Beschränkung des Ausgaberaumes auf zwei Dimensionen zu einem Streudiagram (Projektion) repräsentieren niederdimensionale Ähnlichkeiten jedoch nicht notwendigerweise die Distanzen. Die Projektion kann zu einer irreführenden Interpretation der Strukturen führen. Die Qualitätsmaße (QM) zur Bewertung der Projektion haben Schwierigkeiten Diskontinuitäten in hochdimensionalen Daten korrekt zu erfassen, weil sie unter Umständen auf falschen Annahmen über die zugrunde liegenden hochdimensionalen Strukturen basieren. Andernfalls könnte mittels einer QM eine globale Zielfunktion definiert werden. Es wäre damit immer möglich, eine strukturerhaltende Projektion durch Optimierung dieser Zielfunktion zu erhalten.

Das aus diesen drei Modulen bestehende Verfahren Databionicswarm (DBS) wird in dieser Arbeit vorgestellt. Das erste Modul des hier vorgeschlagenen Ansatzes besteht darin, hochdimensionale Distanzen in der zweidimensionalen Projektion durch eine dreidimensionale topographische Karte mit hypsometrischen Farben zu visualisieren. Die resultierende topographische Karte ist die Weiterentwicklung der "generalisierten U-matrix".

Im zweiten Modul wird das neue Projektionsverfahren Pswarm vorgeschlagen. Pswarm nutzt die Konzepte der Schwarmintelligenz, Selbstorganisation, Symmetrieüberlegungen der Physik und das Nash-Gleichgewichtskonzept aus der Spieltheorie. Für Pswarm entfällt die Notwendigkeit einer globalen Zielfunktion. Dieses Projektionsverfahren erfordert, abgesehen von der Distanz, keine Eingabeparameter für die Projektion. Durch Selbstorganisation können Strukturen von hochdimensionalen Daten durch einen Prozess abgebildet werden, der als Emergenz bekannt ist. Die Erwartung hat sich bestätigt, dass ein Schwarm aus intelligenten Agenten für die Visualisierung und Clusteranalyse verwendet werden kann. Pswarm wurde mit den üblichen Projektionsmethoden PCA, CCA, t-SNE, ESOM, NeRV und dem MDS-Technik-Sammon-Mapping verglichen. Hierbei wurde ein neues Qualitätsmaß (Delaunay Classification Error, DCE) eingesetzt. Der DCE ermöglicht durch die Verwendung vorgegebener Klassifikationen eine unvoreingenommene Beurteilung der Projektionsqualität für beide Arten von Strukturen. Die Ergebnisse zeigen, dass es mit Pswarm-Projektionen möglich ist Projektionen resultierend aus der Optimierung einer globalen Zielfunktion zu übertreffen.

Im dritten Modul werden die Ansätze früherer Arbeiten erweitert, indem kürzeste Wege zwischen geodätischen Abständen der abstrakten U-Matrix von projizierten Punkten für die Clusteranalyse verwendet werden.

DBS übertrifft die gängigen Methoden der Clusteranalyse (k-means, PAM, Single-Linkage, Spektralclustering, modellbasierte Clustering und Ward) hinsichtlich Stabilität und Plastizität auf einem künstlichen Benchmark-System von Datensätzen (FCPS). Im Gegensatz zu anderen üblichen Methoden der Clusteranalyse findet DBS keine Cluster, wenn keine natürlichen Cluster vorhanden sind. Die Anzahl der Cluster kann hierbei mit Hilfe einer Visualisierung abgeschätzt werden.

Die Anwendung von DBS auf drei hochdimensionale und multivariate Datensätze für den praktischen Gebrauch (Leukämie, Welt-Bruttoinlandsprodukt, Tetragonula-Bienen) reproduzierten bereits bekannte Erkenntnisse. In zwei aktuellen Anwendungen, Hydrologie und Schmerz-Gene findet DBS plausible und erklärbare Cluster.

Durch die Modularität lässt sich DBS zu einer projektionsbasierten Clusteranalyse verallgemeinern. Sollte Vorwissen gegeben sein, kann die Visualisierung durch die generalisierte U-Matrix und das DBS-Clustering auf jede Projektionsmethode für beide Strukturtypen (kompakt oder verbunden) angewendet werden. Alternativ können durch die verallgemeinerte U- Matrix-Visualisierung die Ergebnisse gängiger Clustermethoden durch die von Pswarm gefundenen Strukturen oder jede andere Projektionsmethode überprüft werden. Darüber hinaus können 3D-Drucke der visualisierten Strukturen von hochdimensionalen Datensätzen mit üblichen 3D-Drucktechniken hergestellt werden.

# **Abstract**

This work introduces a new approach for cluster analysis defined as projection-based clustering. The projection based clustering combines structures preserved in two dimensions with underlying high-dimensional structures, if natural clusters exist in high-dimensional data. Clusters are defined as natural, if they are based on patterns in high-dimensional data characterized by discontinuity. Discontinuous patterns, which can either be based on distance or density, are described in this work as compact or connected structures. Natural clusters with compact structures are defined mainly by inter- versus intracluster distance, whereas the connected structures are based on the idea of neighborhoods present between data points.

With the use of basic principles founded on graph theory, this work demonstrated that the objective functions of clustering and visualization are based on the fundamental distinction between connected and compact structures. The derived conclusion is that in a case when the goal is to achieve a structure-preserving visualization or clustering, the optimization of a mathematical objective function could yield misleading results if the underlying structures of the highdimensional data do not coincide with the objective function. The question that arises is how to recognize structures that defines clusters in a high-dimensional data set without prior knowledge. The argument here is that dimensionality reduction methods may help solve this problem.

Projections are common dimensionality reduction methods to visualize high-dimensional data in a two-dimensional space. However, when restricting the Output space into two dimensions resulting in a two dimensional scatter plot (projection) of the data, low dimensional similarities do not represent high dimensional distances coercively. This could lead to a misleading interpretation of the underlying structures. Further, it is argued here that the quality measures (QMs), which evaluate this projection, have difficulties to correctly grasp discontinuities in high-dimensional data; this is because they imply assumptions about the underlying high-dimensional structures. Otherwise, a global objective function could be defined using the best QM, and it would always be possible to obtain a structure-preserving projection or clustering by optimizing this objective function.

Therefore, the first module for a solution proposed here is to visualize high-dimensional distances in the projection through a three dimensional topographic map with hypsometric colors, which is a further development of the generalized U-matrix.

After an extensive review of application of artificial intelligence in data science, two interesting concepts are addressed here, called self-organization and swarm intelligence. The irreducible structures of high-dimensional data can emerge through self-organization in a phenomenon called emergence. If properly applied through the use of a swarm of intelligent agents, the datadriven approach presented in this work can outperform the optimization of a global objective function in the tasks of clustering and dimensionality reduction.

Here, the second module called Pswarm, is presented for projecting high-dimensional data. Pswarm exploits the concepts of swarm intelligence, self-organization, symmetry considerations in physics, and the Nash equilibrium concept from game theory. It eliminates the need for a global objective function and does not require any input parameters for projection besides a distance. The data-driven Pswarm was compared to the common projection methods PCA, CCA, t-SNE, ESOM, NeRV and the MDS technique Sammon mapping. Using the new quality measure (Delaunay classification error) this work showed that the resulting two-dimensional projections of Pswarm are comparable to the state of the art projection methods like NeRV and ESOM. By using prior classifications, the Delaunay classification error allows for an unbiased evaluation of projection quality for both types of structures.

For the third module, the author expands the idea of previous works by using shortest paths between geodesic distances of the abstract U-matrix of projected points in the case of cluster analysis. The whole method is called Databionic swarm (DBS) and it outperforms the common clustering methods (k-means, PAM, single-linkage, spectral clustering, model based clustering and Ward) in terms of stability and plasticity on an artificial benchmark system of data sets (FCPS). Contrary to other common clustering methods, the DBS finds no clusters if no natural clusters exist. The number of clusters can be estimated with the help of the topographic map.

On three different high dimensional and multivariate data sets (types of leukemia, world gross domestic product, Tetragonula bees), the already known insights can be reproduced. In two real world applications of hydrology and pain genes, the DBS retrieves meaningful clusters, which was confirmed by domain experts.

Through the modularization, DBS can be generalized to projection to projection-based clustering. The visualization by the generalized U-matrix and the DBS clustering can be applied to every projection method for both types of structures. Through the use of the topographic map, results of common clustering methods can be verified by the structures found by Pswarm or any other projection method. Additionally, 3D prints of the visualized structures of high dimensional data sets can be manufactured with common 3D printing techniques

# **1 Introduction**

We live in a time when information is cheaply available and saved as data nearly everywhere. The amount of generated data is growing exponentially. By the end of the year 2016 alone, 9000 exabytes of data will have been generated, equal to 9 trillion gigabytes or the capacity of 360 billion Blu-ray Discs [Schiele, 2016]. The goal of the interdisciplinary field of data science is to extract knowledge from these data with the help of statistics, machine learning or data mining. Unlike in physics, a data scientist hardly ever starts with a hypothesis; he also is not interested in the source of the data or how they were collected. The data must be mined to gain knowledge through the identification of consistent patterns, and this is usually a very trying task.

Among the various available methods of analyzing data, the focal point of this work is cluster analysis. In contrast to common approaches, the goal here is not merely to group similar information but also to explain why the grouping of information in a certain context is valid, nontrivial and useful. Only then will the clustering of data be helpful to a domain expert. Cluster analysis "is a discipline on the intersection of different fields and can be viewed from different angles, which may be sometimes confusing because different perspectives may contradict each other" [Mirkin, 2005, p. 33]. From the statistical perspective, some assumption regarding the underlying model is required, and data clusters are viewed as probability distributions whose properties can be estimated from the data themselves [Mirkin, 2005, pp. 33-34]. "A trouble with this approach is that in most cases clustering is applied to phenomena of which nothing is known" [Mirkin, 2005, p. 34]. Here, cluster analysis is regarded as the process of generating a classification based on empirical data in a situation in which clear theoretical concepts and definitions are absent and the patterns and laws governing the situation are unknown (see [Mirkin, 2005, p. 36]). The concept of every application (available as open-source code in the R language [R Development Core Team, 2008]) used throughout this thesis is based on this idea.

The goal of this work is to provide an open-source framework for cluster analysis that is founded on a swarm-based projection method and uses a human-understandable visualization approach based on a topographic map of high-dimensional data structures, with the option of 3D printing (see [Thrun et al., 2016a]). This framework should be sufficiently stable while remaining adaptive and exhibiting sufficient plasticity to permit the creation of clusters of various shapes. It should include only a very few non-sensitive parameters that can be visually deduced by a non-professional data miner without any need to understand the theory behind them.

To achieve this goal, expertise on various topics from various areas of research will be required. It is the author's experience that experts in different fields rarely share or exchange practical approaches, and almost nobody is interested in providing and willing to provide easily available and human-understandable solutions to domain experts.

Here, the main hope is to be able to provide reproducible cluster analysis solutions for nonprofessional data miners and to deliver human-understandable concepts of high-dimensional data structures that are simultaneously able to be processed by machines. In the context of the Databionic swarm (DBS) approach, the author attempts to build, use and explain connections

among various fields of research; to be precise, the author will illustrate connections between cluster analysis [Hennig et al., 2015; Jain/Dubes, 1988], the imitation of collective behavior [Beni/Wang, 1993; Bonabeau et al., 1999; Reynolds, 1987], the visualization of information [Venna et al., 2010] and its evaluation, machine learning applications [Herrmann/Ultsch, 2008c], game theory [Nash, 1951], symmetry considerations in physics [Feynman et al., 2007, pp. 147-153, 745] and emergence [Ultsch, 2007]. Undoubtedly, making connections between different schools of thought sometimes requires simplifications. For example, with regard to the collective behavior of bees, the fact that bees have a queen who influences their behavior remains unaddressed in this work. Such simplifications are necessary for analytical modeling and applications of cluster analysis.

Chapter 2 addresses most of the necessary definitions and lays the groundwork for all of the mathematical notation used throughout the thesis. The literature reviewed in chapter 3 shows how common clustering methods tend to implicitly assume the patterns or structures sought in data. The reviewed clustering methods are grouped based on their definitions of generalized neighborhoods.

Chapter 4 introduces and classifies common methods of projecting high-dimensional data into two dimensions. Such projections are necessary to cope with the pitfalls of higher dimensions (see, e.g., [Bouveyron/Brunet-Saumard, 2014, pp. 55-57; Verleysen et al., 2003]). Two- or three-dimensional projections will always result in errors; however, gaining a spatial understanding of more than three dimensions is typically an excessively complex task for humans.

Chapter 5 presents examples to depict the typical errors encountered and describes efforts to manage these errors by means of the U-matrix visualization approach [Ultsch, 2003a]. By contrast, chapter 6 demonstrates a more stringent mathematical approach based on quality measures (QMs) presented in the literature. The evaluation of 19 QMs yields a grouping of the QMs based on their implied characterization of structures of high-dimensional data using the definition of neighborhoods introduced in this thesis. Consequently, it is not possible to generalize any of the QMs. If it were possible, the corresponding optimization approaches would not imply any prior assumptions about the structures of high-dimensional data and, consequently, would outperform any other projection methods.

Chapter 7 discusses a nature-inspired and behavior-based system of data science with the goal of using emergence, instead of the optimization of an objective function, for data visualization and clustering.

Building on the insights gained in chapter 7, chapter 8 introduces the DBS concept. Because it relies on the self-organization of data and emergence, DBS does not imply any particular structure that is sought in data. In the context of the projection, visualization and clustering of artificial or high-dimensional data, chapters 10-12 compare DBS with various common methods and apply the DBS framework both to reproduce known insights and to gain new knowledge about various types of data, e.g., multivariate time series or genetic data.

Readers may skip certain chapters depending on their interests. However, the contents of some chapters are based on insights from previous chapters, as indicated by arrows in Figure 1.1, which outlines the organization of this work. Please note, that due to technical limitations the figures and equations are numbered chapter wise.

Figure 1.1: Dependency graph of the chapters. BBS: behavior based systems; QAV: Quality Assessments of Visualizations; DBS: Databionic swarm. The underlying concept of DBS is based on insights from chapters 3, 5 and 7 (orange). The evaluation of DBS is performed in three steps (green): general validation in chapter 10, the reproduction of known knowledge in chapter 11, and the generation of new knowledge, as validated by domain experts, in chapter 12.

License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. **Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **2 Fundamentals**

The first section of this chapter familiarizes the reader with the definitions of the basic notation and terminology used in this thesis. Concepts of graph theory are introduced in the next section. They give rise to a new concept of neighborhoods, which is utilized in several chapters. The last section explains a possible approach to knowledge discovery, which is applied in chapters 11 and 12.

# **2.1 Basic Definitions**

# **Hilbert space**

Let be a vector space above a field Κ with the following properties for every pair of elements :Κ∈ߙ and ∋ ሻݖ ,ݕ ,ݔሺ

1.) 〈.,.〉: x → Κ is a non-degenerate symmetric bilinear form:

 0 〈ݔ ,ݔ〉: ∋ ݔ ∀ .a

$$\text{b. } \langle \mathbf{x}, \mathbf{y} \rangle\_{\mathcal{H}} = 0, \forall \text{ } \mathbf{y} \in \mathcal{H} = \succ \mathbf{x} = \mathbf{0}$$


Thus, is a Hilbert space (for further details, see [Bronstein et al., 2005, pp. 635-636; Nolting, 2001, p. 22]).

# **Bra-ket notation**

Bra-ket notation 〈. |.〉 is used in physics to describe functions or vectors in a Hilbert space when the coordinate system of the vectors is irrelevant. The left part is called the bra (ۦ.(| , and the right part is the ket (|. ۧ). This notation is used to describe physical states (it is also called Dirac notation, as described in [Dirac, 1981, pp. 15-22]; for a formal introduction, see [Nolting, 2001, pp. 147-148]).

# **Operator**

An operator ܣመ is an unambiguous mapping of each element |ߙۧ of the subset ܦఈ ⊆ to an element |ߚۧ ∋ ܹ ⊇ such that หߚۧ ൌ ܣመ หߙۧ ൌ |ܣመ ߙۧ, where ܦఈ is the definition range of ܣመ and the set of all |ߚۧ is the domain of ܣመ, as defined in [Nolting, 2001, p. 153]; see also [Bronstein et al., 2005, pp. 49,639-640]. An "operator is considered to be completely defined when a result of its application to every ket vector [|ߙۧ [is given" [Dirac, 1981, p. 23].

# **Observation**

An observation *f* is a set of measured values for the properties of a phenomenon. It is described in the bra-ket notation as the change from one physical state ݕۦ |to another physical state |ݔۧ that results from the measurement of the operator ݂መ, as denoted by ݂ ൌ 〈ݕ|݂መ|ݔ) 〈see [Feynman et al., 2006, pp. 145, 147]). Such an observation *f* is a measurement of a physical process.

# **Feature**

Each individually measurable property *r* of a phenomenon being observed can be mapped to an operator ݎ̂ that can be applied to a physical state |ݔۧ] Stöcker et al., 2007, p. 744]. Such an individually measurable property is called a **feature**, **attribute** or **observable**. Here, an approximately continuous distribution of values in the vector space Թௗ is additionally assumed for a **variable** (see the definition of the **distribution of a variable**)**.** 

# **Data**

A batch of data is defined as a matrix 〈݅หܣመ ห݆〉 ൌ ܣ, in which **facts1** about a physical state are summarized based on observations of the form 〈ݕหܣመ หݔ 〈ൌ ∑ 〈ݕ|݅〉〈݅หܣመ ห݆〉〈݆|ݔ 〈of a phenomenon in a Hilbert space, where ۦ݅|, ۦ݆|,| ݅ۧ and |݆ۧ are the basic states relevant to the phenomenon (for further discussion, see [Feynman et al., 2006, pp. 147-150]).

## **Distribution of a variable**

A formal distribution ݂݀ is defined as the probability density of a feature ݎ:

݂݀ሺݎሻ ൌ lim → 〈௫,| ௫〉 √ሺሻ [Nolting, 2001, p. 150]. If the feature *r* is continuous, then it is called a **variable** ݖ ∋ Թௗ, and *df* is called its probability density function (**pdf**) (see [Goodfellow et al., 2016, p. 58]). Here, when it describes how the relative probability of a variable ݖ takes on a given value, such a distribution is a pdf that is assumed to be normalized as follows [Walck, 2007, p. 15]: ݂݀ሺݖሻ݀ݖ ൌ 1 <sup>ஶ</sup> ିஶ .

"Statisticians often use the distribution function or as physicists more often call it the cumulative function which is defined as ݂ܿ݀ሺݖሻ ൌ ݂݀ሺݖሻ݀ݖ <sup>௫</sup> ିஶ " [Walck, 2007, p. 15].

If not elaborated further, here, the distribution of a variable *z* is regarded as an approximation of its pdf; for further details, see, for example, [Bock, 1974, p. 250; G. Ritter, 2014, p. 275 ff], and for types of pdfs, see [Walck, 2007].

# **Dirac delta function**

The Dirac delta function ߜ is a function with the following properties [Jackson, 1999, p. 31]: 1.) ߜሺݖെܽሻ ൌ 0 iff ݖ ് ܽ 2.) ߜሺݖെܽሻ ൌ ൜1, if ݖ ൌ ܽ lies in the integration area under the curve 0, otherwise

# **Density of data**

Let *dn* be the number of observations in an **elementary volume** (see [Bronstein et al., 2005, p. 491]) ݀ௗ෨ ݒ ൌ ݀ݒଵ ∗ ݀ݒଶ ∗. . ݀ݒௗ෨ ൌ ݀ݒ <sup>Ԧ</sup>of the Hilbert space Թௗ෨ (henceforth, Թௗ); then, the density of the data is defined as ߩሺݒ Ԧሻ ൌ ௗ ௗ௩ሬԦ , where ߩ: Թௗ → Թ is the density field function. Here, ߩ is subject to the condition that N is the number of data points defined by

ܰ ൌ ߩሺݒԦሻ <sup>Թ</sup> <sup>d</sup>ݒ Ԧൌ <sup>∑</sup> ߜሺݒԦെݒԦሻdݒ Ԧே <sup>Թ</sup> ୀଵ , in analogy to [Jackson, 1999, p. 33], where ߜ is the Dirac delta function and ߩሺݒԦሻ ൌ ∑ ݍߜሺݒԦെݒԦሻ ே ୀଵ is the charge density of point charges. Then, the **homogeneity** of the data is defined as

ܰ ൌ ߩሺݒԦሻdݒ ԦԹ ൌ ߩdݒ ԦԹ ൌ ߩ dݒ ԦԹ , where ߩ ൌ const.

 1 See [Fayyad et al., 1996, p. 6].

#### **Pattern**

A "[p]attern is an expression *E* in a language *L* describing facts [*F*] in a subset ܨா of *F*. *E* is called a pattern if it is simpler than the enumeration of all facts in ܨா" [Fayyad et al., 1996, p. 7]. Here, the expression *E* is "simpler" if it describes a group of similar (see the definitions of **metric space** and **distance** below) or homogeneous observations.

In graph theory, a pattern may be described by a neighborhood *H* (see the graph theory section for details). If the observations are not directly comprehensible, such a pattern is called a *hidden pattern*.

#### **Discontinuity in data**

A set of data can exhibit discontinuity if

$$\int\_{\mathbb{R}^d} \rho(\vec{v}) \mathrm{d}\vec{v} \neq \rho\_0 \int\_{\mathbb{R}^d} \mathrm{d}\vec{v},$$

which means that the density of data ߩ depends on its location ݒ Ԧin the Hilbert space Թௗ; Discontinuities can occur when interruptions or distortions exist in the homogeneity of the data, or in the continuity of the distribution of the data, in Թௗ. Thus, there are elementary volumes dݒ Ԧ with high density and elementary volumes dݒ Ԧwith low density or even empty elementary volumes. In the one-dimensional case, such a discontinuity can be mathematically defined as an essential or jump discontinuity. In two or three dimensions, a discontinuity may manifest as a spatial separation (see, e.g., Figure 2.1 or chapter 5 and 9, the Hepta data set).

In a higher-dimensional case, a discontinuity represents a change in the characteristics of facts, resulting in multiple patterns (see, for example, the leukemia data set, chapter 3, Figure 3.7 and chapter 9).

Figure 2.1: Spatial separation of data, after [Handl et al., 2005].

#### **Metric space and distance**

Let a metric space be represented by an ordered pair *(M, d)*, where *M* is an arbitrary set and *d* is a metric on *M*, i.e., a function

$$\mathsf{d}: \mathsf{M} \times \mathsf{M} \to \mathbb{R}$$

such that for any *l*, *j*, *m* ∈ *M* ,

$$\begin{aligned} d(l,j) &= d(j,l) \\ d(l,j) &\ge 0 \\ d(l,j) &= 0, \text{iff } l = j \end{aligned}$$

and the triangle inequality is satisfied as follows:

$$d(l, f) + d(j, m) \ge d(l, m)$$

Then, the metric *d* is also called a **distance** (see [Bronstein et al., 2005, pp. 624-625]). By contrast, for a **dissimilarity**, denoted by ݀መ, the triangle inequality may not apply ([Bock, 1974, pp. 25-26]. The distance between two **similar** points ݈, ݆ ∈ ܯ is small, whereas that between two **dissimilar** points ݈, ݆ ∈ ܯ is large. Transformations exist between a dissimilarity ݀መ and a distance *d* (e.g., [Bock, 1974, pp. 77-79]).

If the distance is defined in an output space O, it is denoted by *d(l, j)*, whereas a distance defined in an input space *I* is denoted by *D(l, j).* An example of a metric space is a Hilbert space that is a real-numbered vector space Թୢ of *d* dimensions. If the distances in a space are defined as Euclidean distances, then the corresponding space is called a Euclidean space.

#### **Data set**

A data set consists of a finite set of observations ݂∈F⊂ୢ෩ of ݀ሚ observed features.

In this work, observations ݂ are assumed to be vectors *l* in a metric space *M*, and features are assumed to be variables, if not stated otherwise.

#### **Input space**

An input space ܫ⊃Թௗ is the d-dimensional space consisting of ݀≤ d෨ variables in a data set that have been selected for a given task and contains *n* data points: ܫ ൌ ሼ݈ଵ,…,݈, ݊ ∈ Գሽ. The properties of an input space are as follows (see [Lee/Verleysen, 2007, p. 243]):


#### **Data point**

A data point ݈∈ܫ is a numeric vector consisting of one observation for each of the *d* variables in the input space, where a vector is an array of numbers arranged in a specific order defined with respect to the d variables.

<sup>2</sup> Note that, in general, the number of data points has greatly increased over time [Goodfellow et al., 2016, p. 21 , Fig. 1.8] and therefore the precise number may change with time

#### **Object**

When the data of interest are a set of facts F consisting of numerical, ordinal or nominal scaled entries, each fact ݂ ∈ ܨ, such that f ∉ Թୢ, is called an object or **case**.

An object can be regarded as a generalization of a data point. If an object can be interpreted (has a meaning within itself), then it contains **information** ([Ultsch, 2016c]; see also [Ultsch, 1994, p. 2]).

#### **Output space**

An output space ܱ⊂Թ is the m-dimensional space such that *m<d* in which, for each point ݆∈ܱ, a mapping to a data point ݈ of the input space I⊂ Թௗ exists.

#### **Machine learning**

The field of machine learning concerns computer programs that can imitate learning behavior [Natarajan, 2014] (see also [Goodfellow et al., 2016, p. 99]). Machine learning comes in two general forms<sup>3</sup> (see [Murphy, 2012, p. 2]). *Unsupervised learning* refers to the task of finding patterns in unlabeled data. Since the data are unlabeled, no reward function exists that can be used to evaluate potential results. If the data set is labeled, then *supervised learning* is possible. A typical supervised learning task is classification or regression. A typical unsupervised learning task is cluster analysis.

#### **Label**

A label is a tag ݃ ∈ ሼ1, . . . , ݇ሽ ⊂ Գ attached to an object ݂∈ܨ that identifies the object via a mapping ݂: ሼ1, . . . , ݇ሽ → ܨ. The labels of such a set of objects range from *one* to *k* [Hennig et al., 2015, p. 2], where k is the number of groups of objects. Here, it is assumed that a label exists for every object.

#### **Classification**

A classification ܥ ൌ ሼܩଵ, ܩଶ,…ሽ is a system of subsets [Bock, 1974, p. 22] such that ܥ⊃ୢ෩ . A subset ܩ ൌ ሼ݈ଵ, ,…݈ሽ݅ ∈ Գ, , is a set of k observations. In an exclusive classification, the subsets are disjunct, denoted by ܩଵ ∩ ܩଶ ൌ ∅; in a non-exclusive classification, elements that overlap between two subsets may exist, denoted by ܩ ∩ ܩ ്∅ . However, overlapping classification is not considered here (for various types of classification, see Figure 2.2 or [Hennig et al., 2015, p. 45]). Supervised and unsupervised classifications are defined as in the context of machine learning.

<sup>3</sup> Reinforcement learning is not considered in this context; semi-supervised learning (e.g. active learning) uses labeled data as well as unlabeled data.

Figure 2.2: Tree of classification types, after [Jain/Dubes, 1988, p. 56]. This work concentrates on unsupervised classification (see unsupervised machine learning).

#### **Classifier**

A classifier is an algorithm that constructs a function ܥ݈ݏ: ܨ → ሼ1, … , ݇ሽ ⊂ Գ that maps objects ݂∈ܨ to class labels ݃ ∈ Գ.

In terms of understandability, a distinction can be drawn between symbolic and sub-symbolic classifiers [Ultsch/Korus, 1993]. Symbolic classifiers are able to acquire knowledge (for a detailed description, see the last section of this chapter). By contrast, sub-symbolic classifiers (e.g., KNN classifiers) are only able to integrate knowledge [Ultsch, 1994], because a characteristic property of a sub-symbolic representation of data is that a single object alone does not contain information (see [Ultsch, 1994, p. 2]).

#### **Projected point**

A projected point ݆ሺݔଵ,..,ݔሻ ൌ ଔԦ is a vector of *m* scalars ݔ in the output space ܱ ⊂ Թ, where a vector is an array of numbers arranged in a specific order such that each individual number can be identified by its index.

#### **Projection**

Let ݆∈ܫ denote data points in the input space *I*⊂ Թௗ, and let ݈∈ܱ denote projected points in the output space O⊂ Թ. Then, a mapping proj: I → O, j ↦ l is called a projection iff ݉ ൌ ݀. ≫ ݉ ∧ ݐݏ݊ܿ

Note that unlike for a projection method, for a manifold learning method, the dimensionality of the output space ݉ depends on the data set (see, e.g., [Lee/Verleysen, 2007, pp. 14-15]).

#### **2.2 Concepts of Graph Theory Applied to Patterns**

This section uses graph theory to describe patterns found in data.

#### **Graph**

"A graph [Γሿ is a pair [Γ ൌ ሺV, Eሻ] consisting of a finite set ്ܸ∅ and a set E of two-element subsets of V. The elements of V are called vertices. An element e = (a, b) of E is called an edge with end vertices a and b. […] [In such a case,] *a* and *b* are adjacent or neighbors of each other" [Jungnickel, 2013, p. 2].

A graph ߁ is called undirected if, for every edge ݁ሺܽ, ܾሻ in E, the edge ݁ሺܾ, ܽሻ is also in E. A graph is called a weighted graph if a number (weight) is assigned to each edge.

#### **Directed graph**

A "directed graph or, for short, a *digraph* is a pair ߁ ൌ ሺܸ, ܧሻ consisting of a finite set *V* and a set *E* of ordered pairs *(a, b),* where ܽ ് ܾ are elements of V" [Jungnickel, 2013, pp. 25-26].

#### **Direct adjacency**

Let Γ be a graph, and let j be a point in a metric space M; then,

 ൟܧ ∋ ሻݒ ,ݒሺ݁ ∧ ∃ ܸ ∈ ݒ |ܯ ∋ ݈൛ ൌ ሻܯ ,߁ ݆,ሺ

is the set of points that are directly adjacent to j. The direct adjacency is defined by the specified graph.

#### **Adjacency matrix**

A digraph Γ with a vertex set ሼ1, . . . , ݊ሽ is specified by an ݊ ൈ ݊ matrix ܣ ൌ ሺܽ ሻ, where ܽ ൌ 1 if and only if ሺ݅, ݆ሻ is an edge of Γ, and ܽ ൌ 0 otherwise. *A* is called the adjacency matrix of Γ [Jungnickel, 2013, p. 40].

#### **Path**

Let ሺ݁ଵ,...,݁ሻ be a sequence of edges in a graph ߁. If there exist vertices ݒ,...,ݒ such that ݁ ൌ ݒିଵݒ for ݅ ൌ 1, . . . , ݊, then the sequence is called a walk; if ݒ ൌ ݒ, one speaks of a closed walk (Figure 2.3). A walk for which the ݁ are distinct is called a trail (Figure 2.3), and a closed walk with distinct edges is a closed trail. If, in addition, the ݒ are distinct, then the trail is a path [Jungnickel, 2013, p. 5].

Figure 2.3: Examples of trails, walks and paths [Jungnickel, 2013, p. 6 Fig. 1.5]: (a, b, c, v, b, c) is a walk but not a trail, and (a, b, c, v, b, u) is a trail but not a path [Jungnickel, 2013, p. 5].

### **Connected Graph**

Two vertices *a* and *b* of a graph *Γ* are called connected vertices if a walk exists with start vertex *a* and end vertex *b*. If all pairs of vertices of *Γ* are connected, then *Γ* itself is called a connected graph. For any vertex a, we consider *a* to be a trivial walk of length 0, such that any vertex is connected with itself. Thus, connectedness is an equivalence relation on the vertex set of *Γ*. The equivalence classes of this relation are called the connected components of *Γ*. Thus, *Γ* is connected if and only if its vertex set *V* is its unique connected component [Jungnickel, 2013, p. 6].

#### **Lattice**

A connected graph Γ with a particular well-defined two-dimensional tiling (tessellation) is defined as a lattice. A ݊ݔ݉ lattice has n vertices on the x-axis and m vertices on the y- axis. If the tiling is rectangular (every vertex has exactly four perpendicular edges) it will be called a **lattice**  (tiling) in this work, if the tiling is hexagonal (every vertex has exactly three edges) this will be called a **grid** (tiling) in this work.

#### **Shortest path**

For a connected graph Γ, there exists a distance *D(a, b*) between two vertices a and b that can be defined as the shortest path between these vertices [Jungnickel, 2013, pp. 65-66] as follows: For each path ܲ ൌ ሺ݁ଵ,…,݁ሻ, let the length of P be ሺܲሻ:ൌሺ݁ଵሻ ⋯ ሺ݁ሻ; then, the distance between two vertices a and b in *(Γ, p)* is defined by

ܩሺܽ, ܾ, ߁ሻ ൌ ൜ ∞, if ܾ is not accessible from ܽ minሼሺܲሻ: ܲ is a path from ܽ to ܾ ݅݊ ߁ሽ, otherwise

Let the vertices be denoted by points ݈, ݆ ∈ ܯ in the metric space *M*; then, G(l, j, ߁ (is the notation if the points ݈ and ݆ lie in the input space *I*, and gሺl, j, ߁ (is the notation if they lie in the output space *O*.

Note that ݀ሺܽ, ܽሻ ൌ 0 always holds because an empty sum is considered to have a value of 0, as usual. If no explicit length function is given, then the shortest paths and distances in a graph are defined using a length function that assigns a length of ሺ݁ሻ ൌ 1 to each edge *e* [Jungnickel, 2013, p. 66]. An algorithm for calculating the shortest paths in a graph is described in [Jungnickel, 2013, pp. 83-87]. The authors Lee and Verleyson have claimed that graph distances outperform the traditional Euclidean metric in terms of dimensionality reduction [Lee/Verleysen, 2007, p. 227].

#### **Acyclic graph**

Let ሺܯ,≽ ሻ be a partially ordered set (a poset, for short), which consists of the set *M* together with a reflexive, antisymmetric and transitive relation ≼, and let *M* correspond to a digraph ߁ with the vertex set *M* and with edges defined by pairs *(a, b)* such that ܽ ≺ ܾ; then, because of the transitive property, ߁ is acyclic [Jungnickel, 2013, p. 49].

#### **Tree**

A tree is a graph ߁ that satisfies the following three conditions [Jungnickel, 2013, pp. 7-8]:


The vertices in a tree are often called nodes. If ሺܽ, ܾሻ is an edge in a tree, then *a* is called the parent of *b*, and *b* is a child of *a*. If a path exists from *a* to *b* (്ܾܽ), then a is a proper ancestor of *b* and b is a proper descendant of *a* [Safavian/ Landgrebe, 1990, p. 2]. If a node has no descendant, it is called a leaf; if a node has no ancestor, it is called a root.

#### **Directed acyclic graph (DAG)**

A DAG is a directed tree (see above) that contains no cycles and one vertex, defined as the root, into which no edges enter. There is a unique path from the root to every vertex [Safavian/Landgrebe, 1990, p. 3]. Every vertex has a descendant called a child, except for the leaf vertices, which do not.

#### **Decision tree**

Let <sup>ܩ</sup> be a subset of a classification ܥ ൌ ሼܩଵ,…,ܩ,...ሽ⊆ୢ෩ ; then, a decision tree is a tree with the following properties:


#### **Decision tree learning**

Decision tree learning refers to a type of supervised machine learning in which decision trees are used (see [Safavian/Landgrebe, 1990]).

#### **Binary tree**

A binary tree is an ordered tree such that [Safavian/Landgrebe, 1990, p. 3] (see also the definition of a DAG)


#### **Lemma 1**

Let ߁ ൌ ሺܸ, ܧሻ be a connected graph with a positive length function p. Then, *(V, D)* is a finite metric space, where the distance function is defined as ܦൌܩሺܽ, ܾሻ [Jungnickel, 2013, p. 68].

#### **Proposition 1**

Any finite metric space can be represented by a pair (߁, ) (network) with a positive length function *p* [Jungnickel, 2013, p. 68].

#### **Ultrametric space**

Note that a metric space can be represented by a tree if and only if the following condition holds for any four vertices *x*, *y*, *z*, and *t* of the given metric space [Jungnickel, 2013, p. 69]:

ሻሻݖ ,ݕሺ݀ ሻݐ ,ݔሺ݀ ,ሻݐ ,ݕሺ݀ ሻݖ ,ݔሺ݀ሺݔ݉ܽ ሻݐ ,ݖሺ݀ ሻݕ ,ݔሺ݀

Changing the triangle inequality to this condition implies an ultrametric space.

#### *2.2.1 Patterns Defined as a Generalization of Neighbourhoods*

Here, it is argued that by using shortest paths and direct adjacency, the patterns that exist in data can be generalized to neighborhoods H of an extent *k*.

Let k∈Գ, *k>0*, let Γ be a connected graph, let ݆ be a point in a metric space ܯ, and let ܩሺ݆, ݈, ߁ሻ be the shortest path between ݆∈ܯ and an arbitrary point ݈∈ܯ ;then (1),

$$H\_f(k, \Gamma, M) = \{ l \in M \mid G(l, j, \Gamma) \le k \} \tag{1}$$

is the neighborhood set of the point *j* and *k* the neighborhood extent. The neighborhood *H* can define a pattern in the input space4 .

The easiest example is a neighborhood defined by distances in a Euclidean graph. In the context of graph theory, a Euclidean graph is an undirected weighted graph of the highest order with respect to all other graphs discussed here, because every vertex is connected to every other vertex. Note that the weights of the vertices in a Euclidean graph need not necessarily be defined by the Euclidean metric. Another representation of a neighborhood H is a Delaunay graph ࣞሺܸ, ܧሻ, which is a subgraph of a Euclidean graph. A Delaunay graph ࣞሺܸ, ܧሻ is based on Voronoi cells [Toussaint, 1980]. Each cell is assigned to one data point, and the size of a cell is characterized in terms of the nearest data points surrounding the point assigned to that cell. Within the borders of one Voronoi cell, there is no position that is nearer to any outer data point than to the data point within the cell. Thus, a neighborhood of data points is defined in terms of direct links between borders of Voronoi cells that induce an edge E in the corresponding Delaunay graph [Delaunay, 1934]. In short, a Delaunay graph represents a graph for a neighborhood ܪሺ1, ࣞ, ܯሻ. A neighborhood H can also be represented by a Gabriel graph ܩሺܸ, ܧሻ [Gabriel/Sokal, 1969], which is a subgraph of a Delaunay graph ࣞሺܸ, ܧሻ in which two points are connected if the line segment between the two points is the diameter of a closed disc that contains no other points within it (empty ball condition). A Gabriel graph represents a graph for a neighborhood ܪሺ1, ܩ, ܯሻ. Another case that is often considered is that of a neighborhood ܪ*)knn,* ܭ*, M),* where the number of nearest neighbors of a point *j* is defined by the number of vertices connected to this point in the *K*-nearest-neighbor graph (KNN graph), e.g., [Brito et al., 1997]. Here, we will use the shorter notation *H(knn, M).*

Figure 2.4: Four points and their Voronoi cells: *D(l, k)>D(l, m)* illustrate the different types of neighborhoods: unidirectional versus direction-based.

<sup>4</sup> Such neighborhoods *H* will prove useful for various evaluation steps, as summarized in Fig. 2.5.

Neighborhoods of points can be divided into two types, namely, *unidirectional* and *directionbased* neighborhoods. Consider the four points shown in Figure 2.4. The points *l*, *k*, *j*, and *l* are in the same neighborhood ܪሺ1, ࣞ, ܯ (in the corresponding Delaunay graph, but the points *l* and *m* are never neighbors in this graph, even if the distance *D(l, m)* is smaller than *D(l, k).* Thus, in this neighborhood definition, the direction information is more important than the real arrangement of the points in space as characterized by the distances *D*.

However, if a neighborhood is defined in terms of a KNN graph, then the points *l* and *m* could be in the same neighborhood ܪሺ݇݊݊, ܭ, ܯ(, and the points *l* and *k* could be in different neighborhoods, depending on the value of ݇݊݊ and on the ranking of the distances between these points. Therefore, this type of neighborhood is called unidirectional. In other words, it can be said that the points *l*, *j*, and *m* are more *dense* with respect to each other than they are with respect to *k*. Thus, unidirectional neighborhoods defined in terms of KNN graphs or unit disk graphs [Clark et al., 1990] can be used to define neighborhoods based on density.

#### **2.3 Overview of Knowledge Discovery**

 *"The term knowledge discovery in databases […] was coined in 1989 to refer to the general process of finding knowledge in data and to emphasize the 'high-level´ application of particular data mining methods" [Fayyad et al., 1996, p. 3].* 

In 1996, Fayyad et al. used this term in his introduction to "From Data Mining to Knowledge Discovery" as follows:

*"Knowledge discovery in databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data" [Fayyad et al., 1996, p. 6].* 

Dropping the suffix *in databases*, the term *knowledge discovery* was extensively discussed in [Mörchen, 2006, pp. 6-7]. According to the definition used in that work, *knowledge discovery* is "data mining with the goal of finding knowledge, i.e., novel useful, interesting, understandable, and automatically interpretable patterns" [Mörchen, 2006, p. 7]. The definition of *data mining* as given in [Mörchen, 2006, p. 7] is

*"The process of finding hidden information or structure in a data […] [set.] This includes extraction, selection, preprocessing, and transformation of features describing different aspects of the data".* 

The following overview in Figure 2.5 presents a possible approach to knowledge discovery, as applied in chapters 11 and 12. It is not claimed here that this view is the only approach available in this research field. The remainder of this chapter will describe the various tasks involved in knowledge discovery which are shown in Figure 2.5.

Figure 2.5: The step-wise process of knowledge discovery, as inspired by [Fayyad et al., 1996, p. 10; Ultsch, 2000b]. The systematic process may contain loops between any steps [Behnisch/Ultsch, 2015, p. 52]. This work focuses on Clustering analysis which will be separately discussed in the next chapter, but in general applying Machine learning algorithms would be the 4th step.

#### *2.3.1 Feature Selection*

In the first step, the "features must be properly selected so as to encode as much information as possible concerning the task of interest. […] minimum information redundancy among the features is a major goal" [Theodoridis/Koutroumbas, 2009, pp. 596-597] (see also [Lee/Verleysen, 2007, p. 230]). Redundancy refers to a case in which certain features of a data set are not independent of each other [Lee/Verleysen, 2007, pp. 1-2]. For example, if the two variables ݈ and ݆ are correlated, then ܦሺ݈, ݆ሻ ൌ ඥ∑ ݈ െ ݆ is no longer a Euclidean distance [Cormack, 1971, p. 326].

#### *2.3.2 Preprocessing*

 *"Preprocessing the data to be mined is utterly important for a successful outcome of the analysis. If the data is not cleansed and normalized, there is a high danger of getting spurious and meaningless results. Cleansing includes the removal of outliers, i.e., data objects with extreme values, replacement of missing values, or the removal of erroneous corresponding data sets" [Mörchen, 2006, pp. 7-8].* 

Sometimes, this first step is already referred to as feature extraction [Bishop, 2006, p. 2]. Many data mining methods rely on the concept of (dis-)similarity between pieces of information encoded in data. For example, for Euclidean distances, "normalization of the data needs to be considered to avoid undesired emphasis of features with large ranges and variances" [Mörchen, 2006, p. 8] (see also [Jain/Dubes, 1988, p. 38]). This process of creating such "syntetic" data features that retain the most important information of a pattern in question is here called feature extraction (consistent with [Mirkin, 2005, p. 208]).

#### *2.3.3 Feature Extraction*

The first step of feature extraction is to determine the distribution of each individual variable.

 *<sup>&</sup>quot;Important tools for this inspection are the quantile-quantile plot (QQ-plot) and kernel estimators for the probability density function (pdf). Here we use the PDE method for pdf estimation [Ultsch, 2003b] as it is specially designed to uncover subsets in the variables" [Behnisch/Ultsch, 2015, p. 54].* 

A QQ-plot makes it possible to compare the given distribution of a variable to standard distributions. Additionally, box-whisker diagrams (boxplots) may be used to visualize the quartiles of a variable.

## *2.3.3.1 Transformations*

*"Real valued data often comes from domains where variables have greatly varying variances because of different scales. Variables with large variances are likely to dominate the obtained distance structure, e.g. when using Minkowski metrics. To overcome this problem, each variable is linearly transformed (standardized) such that the estimated variance is the same on all variables. The Z-score scheme transforms a variable's values* ݔ ← ሺݔ െ ݉ሻ/ߪ *with mean m and standard deviation σ"* [Herrmann, 2011, p. 28].

If a variable can be non-linearly transformed to a normal distribution, the Box-Cox algorithm (see [Asar et al., 2014]) is often used to estimate the factor of the transformation. With an approximation of the factor obtained from the ladder of powers [Tukey, 1977], an "understandable" transformation, e.g., "log" or "sqrt," can be applied that is as near as possible to the factor of the Box-Cox algorithm. "These allow for hypotheses on why the distribution is shaped in a particular way" [Behnisch/Ultsch, 2015, p. 56].

For non-normally distributed variables (e.g., a variable with a multimodal distribution), a meaningful variance ߪଶ may be difficult to estimate. "Instead, a (robust) min/max-standardization transforms a variable's values ݔ ← ௫ିሺ௫ሻ ௫ሺ௫ሻିሺ௫ሻ with robust estimates ݉݅݊ሺݔሻ, ݉ܽݔሺݔሻ for minimum and maximum values. There is empirical evidence by Milligan and Cooper [Milligan/Cooper, 1988] that min/max standardization is to be preferred over Z-score, especially if variances of underlying distributions is [sic] hard to estimate" [Herrmann, 2011, p. 28]. In this context, ݉ܽݔሺݔሻ and ݉݅݊ሺݔሻ are estimated as the 95th and 5th percentiles, respectively, of the distribution [Herrmann, 2011, p. 127].

### *2.3.3.2 Dimensionality Reduction*

A common approach to feature extraction is dimensionality reduction (DR). To cope with the "curse of high dimensionality" (for further details, see [Verleysen et al., 2003]), dimensionality reduction reduces an input space I⊂ Թௗ to an output space O⊂ Թ such that ݉൏݀ [Lee/Verleysen, 2007].

*"All difficulties that occur when dealing with high-dimensional data are often referred to as the 'curse of dimensionality´. When data dimensionality grows, the good and well-known properties of the usual 2D or 3D Euclidean spaces make way for strange and annoying phenomena" [Lee/Verleysen, 2007, p. 3].* 

The various phenomena related to this concept are explained in [Lee/Verleysen, 2007, pp. 4-9] (see also [Bellman, 1957]). A DR method is usually either a manifold learning method or a projection method. DR methods such as autoencoders [Hinton/Salakhutdinov, 2006], Isomap [Tenenbaum et al., 2000] or local linear embedding (LLE) [Roweis/Saul, 2000] that are designed to find a manifold5 that represents a given set of high-dimensional data6 are called *manifold learning* methods. Such methods are disregarded here because these manifolds usually have more than two dimensions. DR methods of the type known as projection methods are

 5 "A manifold is a connected region. Mathematically, it is a set manifold of points, associated with a neighborhood around each point. From any given point, the manifold locally appears to be a Euclidean space." [Goodfellow et al., 2016, p. 160] 6

Often described using the term *intrinsic dimension* (e.g., [Lee/Verleysen, 2007, pp. 18-24, 41, 47ff]).

separately introduced in chapter 4. There, the focus is placed on methods that attempt to visualize information by means of projections that are restricted to visualizing high-dimensional data in a two-dimensional space while preserving their structure (for details, see chapter 5). The quality of a projection critically depends on the concept of dissimilarity that is chosen to be applied to the input space *I*. This concept could be a definition based on either distance or local proximity. An index used to evaluate the quality of a projection is called a quality measure (QM), and 19 QMs are introduced in chapter 6.

#### *2.3.4 Cluster Analysis*

Many data mining methods rely on some concept of the dissimilarity between pieces of information encoded in the data of interest. These methods are used for cluster analysis, and common approaches will be described in the next chapter. Cluster analysis is the task of unsupervised classification that results in a clustering. Given a data set *I* that contains *n* data points, the objective of cluster analysis is to group the data points into *K* disjoint subsets of *I*, denoted by ܿଵ,…,ܿ [Hennig et al., 2015, p. 2]. "A clustering is […] the partition obtained" with

Κ ൌ ሼܿଵ,…ܿሽ. If a data point *l* belongs to a cluster ܿ, then it has the class label ݃∈Գ. In the literature, this process is often called hard clustering to distinguish it from methods such as fuzzy clustering, in which a fractional degree of membership is assigned to each ݈∈ܫ] Jain et al., 1999].

#### **Cluster**

No generally accepted definition of clusters exists in the literature [Hennig et al., 2015, p. 705]. When describing clusters, the term *pattern* is often used (e.g., [Theodoridis/Koutroumbas, 2009]).

Here, consistent with Bouveyron et al., it is assumed that a cluster is a group of similar objects [Bouveyron et al., 2012]. Chapter 3 will elaborate on this statement while presenting the definition of *natural* clusters.

#### **Intracluster Distance**

Let ܿ ⊂ ܫ be a cluster such that ∀ܿ ⊂ ܫ, where , ݍ ∋ ሼ1, … , ݇ሽ and ്ݍ, ܿ∩ ܿ ൌ ሼ ሽ; then, the distance ܫ݊ݎݐܽሺܿሻ ≔ ܦሺ݈, ݆ሻ between two data points ݆, ݈ ∈ ܿ, is called an intracluster distance.

#### **Intercluster Distance**

Let ܿ ⊂ ܫ and ܿ ⊂ ܫ be two clusters such that , ݍ ∋ ሼ1, … , ݇ሽ , ܿ ∩ܿ ൌ ሼ ሽ, and ്ݍ ; then, the distance ܫ݊ݐ݁ݎ൫ܿ, ܿ൯ ൌD(݆, ݈) between two data points ݆ and ݈ in the two clusters, ݆∈ܿ and ݈∈ܿ, is called an intercluster distance.

#### **Compact Structures**

Compact structures in a data set are mainly defined by distances *d* if discontinuity in data exist such that the intracluster distances are small and the intercluster distances are large. Note, that the distance distribution is often bimodal if the data structures are compact. This type of structures leads to natural clusters (see chapter 3).

#### **Connected Structures**

Connected structures in a data set are mainly defined by density ߩሺݒ Ԧሻ if discontinuity in data exist. If a connected graph Γ is chosen appropriately regarding the data set, these data structures are based on neighborhoods ܪ ሺ݇, Γ, ܯሻ. This type of structures leads to natural clusters (see chapter 3).

#### *2.3.5 An Approach to Knowledge Acquisition*

If, for a given data set, there exist labels defined by a clustering or a domain expert, the next step may be to determine what each cluster means [Behnisch/Ultsch, 2015, p. 65] or what kind of knowledge can be acquired from it7 .

*"Under knowledge we understand a symbolic representation of objects, facts and rules for an interpreter with symbol processing capability, e.g. a human8 . In particular, knowledge is communicable by word or writing" [Ultsch, 1994, p. 1] (see also [Ultsch, 1987, p. 22]).* 

Knowledge has the properties of being valid, comprehensible, nontrivial, potentially innovative and useful in practice [Behnisch/ Ultsch, 2015, p. 52]. It can be stored in a knowledge base, which "is an organized collection of knowledge together with operations for accessing and manipulating knowledge" [Ultsch, 1987, p. 22]. One example of a representation of knowledge is a rule [Ultsch, 2016c], which is defined as a prescription regarding how to generate, interpret and manipulate facts [Ultsch, 1987, p. 22].

In the context of knowledge discovery, knowledge acquisition can be defined "as the encoding of knowledge into the formal representation scheme of a knowledge-based system [KBS]" [Ultsch, 1987, p. 23]); here, a KBS is defined as "a computer program that contains an explicit, formal representation of knowledge in a knowledge base and is capable of [drawing conclusions9 ]" [Ultsch, 1987, p. 23]. In another context, researchers may interview domain experts "to become educated about the domain and to elicit the required knowledge, in a process called knowledge acquisition" [Russell et al., 2003, p. 217]. In short, knowledge acquisition can be described as a process that leads to a formal representation of knowledge (see [Aikins, 1983]), for example, a process leading to the generation of rules required for a computer program, e.g., DENDRAL [Russell et al., 2003, p. 22] or MYCIN [Aikins, 1983]. One possible approach to knowledge acquisition is to use machine learning [Russell et al., 2003, p. 687]. With regard to understandability, the machine learning methods used for this purpose can be classified as either symbolic or sub-symbolic methods [Ultsch/Korus, 1993].

*"Sub-symbolic methods model the structure of data using many numerical parameters. They are usually aimed at prediction or classification. The output of sub-symbolic methods often depends on the values and interactions of most or all model parameters. They fail to explain the prediction or classification. There are certainly areas of data mining where it is sufficient to build such black-box models that can approximately reproduce a classification or predict future data. An important requirement for knowledge discovery is the interpretability of the results. In many domains the expert wants to know why a decision was made or what a […] pattern describes. Comprehensible descriptions of the models are crucial for success in this case"* [Mörchen, 2006, p. 120].

For the acquisition of knowledge through cluster analysis, symbolic methods are preferable, as described in chapters 11 and 12 (see also [Ultsch, 1994]). In chapter 12, decision tree learning

<sup>7</sup> In another context one would like to explain a prediction done by a machine learning algorithm.

<sup>8</sup> For humans 7±2 rules appear to be the optimum [Miller 1956].

<sup>9</sup> Formally defined as *inference* in [Ultsch, 1987, p. 22].

is used in a knowledge acquisition approach called Classification And Regression Tree (CART) analysis [Breiman et al., 1984]). This method relies on a binary tree in which the splitting criteria (decisions) for the vertices are expressed in terms of the Gini index (for further details, see [Safavian/Landgrebe, 1990, p. 15]).

"A class is described by a number of conditions" [Ultsch/Korus, 1993, p. 3] that lead to the generation of a subset ܩ ⊃ ܥ defined by a previously identified clustering. Additionally, for each class, a unique class label ݃∈Գ exists for all ∋ ܩ. Every observation ∋ܩ can be unambiguously described by one or more properties that are shared among all observations of ܩ. Here, the conclusion that an observation can be correctly assigned to a class ܩ is reached based on the conditions defining a path (rule) from the corresponding leaf to the root of the binary tree, and this conclusion is called the decision to place in ܩ. Therefore, the class ܩ has a semantic characterization because it is characterized by the rules governing the decision tree, which allow this class to be distinguished from other classes. Here, it is assumed that the last step in the evaluation of a clustering is to ask domain experts to validate the identified classes.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. **Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International

# **3 Approaches to Cluster Analysis**

Many data mining methods rely on some concept of the similarity between pieces of information encoded in the data of interest. Various names have been applied to these clustering methods, depending largely on the field of application in data science. For example, in biology the term "numerical taxonomy" is used [Thorel et al., 1990], in psychology the term Q analysis is sometimes employed, market researchers often talk about "segmentation" [Arimond/Elfessi, 2001] and in the artificial intelligence literature, unsupervised pattern recognition is the favored label [Everitt et al., 2001, p. 4]. The corresponding methods can be either data-driven or needdriven. The latter, called also constraint clustering [Tung et al., 2001] aims at organizing the true structure to meet certain application requirements such as energy aware sensor networks, privacy preservation, and market segmentation [Ge et al., 2007, p. 320]. An overview of constrained clustering algorithms can be found in [Basu et al., 2008].

Here, however, the focus is placed on data-driven10 methods, in which patterns present in the data are used to identify homogeneous groups of objects [Arabie et al., 1996, p. 8 ff.]. Consequently, the term *cluster analysis* is used to refer to a step in the knowledge discovery process (chapter 2, Figure 2.5.). Let it be assumed that in Figure 3.1 (top left), the first data set (I) contains two variables11. The division of this homogeneous data set into different patterns would be called dissection [Everitt et al., 2001, p. 7]. By contrast, *natural clusters* do not require dissection; instead, they are clearly separated in the data [Duda et al., 2001, p. 539; Theodoridis/Koutroumbas, 2009, pp. 579, 600], as shown in the second data set (II) in Figure 3.1 (top right).

No generally accepted definition of clusters exists in the literature [Hennig et al., 2015, p. 705]. Additionally, Kleinberg showed for a set of three simple properties (scale-invariance, consistency and richness), that there is no clustering function12 satisfying all three [Kleinberg, 2003]. By concentrating on distance and density based *structures13*, this work restricts clusters to "natural" clusters (see section 2) and therefore omits the axiom of richness where all partitions should be achievable. Consequently, only natural clusters, in which objects are similar within clusters and dissimilar between clusters [Bouveyron et al., 2012], are considered here. For example, the distance distribution in the input space can be bimodal, indicating a distinction between the inter- versus intracluster distances: in data set I in Figure 3.1 (bottom left), no large intercluster distances exist and the distribution of the distances is unimodal, whereas in data set II in Figure 3.1 (bottom right), the distribution of the distances is bimodal because data set II contains two natural clusters with a large intercluster distance. Another example is the case in which the number of data points in one *elementary volume* (݀ݒ (Ԧof the input space is higher than that in another elementary volume ݀ݒԦ, which can be estimated using a nonparametric technique for density estimation (e.g., kernel density estimation). In a third example, local proximities can be defined as structures based on neighborhoods ܪሺ݇, ߁,ܯሻ (see chapter 2.2.1).

M. C. Thrun, *Projection-Based Clustering through Self-Organization and Swarm Intelligence*, https://doi.org/10.1007/978-3-658-20540-9\_3

<sup>10</sup> The progress in an "algorithmic activity" is enforced by data w.r.t. patterns (as opposite to intuition or personal experience, e.g. through the setting of parameters). 11 In fact, this figure shows a CCA projection of the leukemia data set (see chapter 9). 12 "[A]ny function f that takes a set S of n points with pairwise distances between them, and returns a partition of

S" [Kleinberg, 2003, p 2]. 13 They can be described as patterns identified based on discontinuity.

Figure 3.1: Data set I is an approximately homogeneous data set with patterns that form no natural clusters (left, top). The distance distribution in this case is not bimodal (left, bottom). Data set II contains two natural clusters with a large intercluster distance (right, top). The distance distribution is bimodal here (right, bottom). See Figure 12.2 or supplement B for a high-dimensional example. Distance distributions was generated using the AdaptGauss CRAN package [Thrun/Ultsch, 2015; Ultsch et al., 2015].

#### **3.1 Common Clustering Methods**

Clustering methods can be broadly divided into two groups: hierarchical and partitional methods [Jain, 2010]. Partitional clustering methods simultaneously divide a set of data points into subsets. Because we are concentrating on *natural clusters,* overlapping clustering is not considered here. It should be remarked that the choice of the clustering algorithm to be used is more important than the choice of the distance calculation [Jain/Dubes, 1988, p. 140].

A prominent example of a partitional clustering method is the well-known *k-means* method of [MacQueen, 1967] (originally from [Steinhaus, 1956]). It proceeds as follows: Once the number of clusters has been chosen, a random initialization of cluster centers, called centroids, is performed in the input space. Then, the nearest data points to each centroid are assigned to that centroid. After the mapping of the data points, the centroids are moved such that the distances from the assigned points to their corresponding centroids are minimized. This process is performed repeatedly. Figure 3.2 illustrates four iterations of the process. In summary, k-means centroids are average points rather than individual data points. Details about the algorithm can be found in [Hennig et al., 2015, p. 68ff].

By contrast, the clustering method called partitioning around medoids (PAM), introduced in [L. Kaufman/Rousseeuw, 1990], minimizes the sum of the distances from the data points within a cluster to one chosen data point in the same cluster, called the medoid [Mirkin, 2005, p. 181]. In other words, the average distance between a medoid and a subset of data points in the same cluster is minimized. Aside from the change from centroids to medoids, the algorithm can be formulated analogously to k-means [Mirkin, 2005, p. 182].

Hierarchical clustering algorithms are based on the "representation of data as a hierarchy of clusters nested over set-theoretic inclusion" [Mirkin, 2005, p. 112]. In the agglomerative approach, such an algorithm begins with each data point in its own cluster and successively merges the most similar pairs of clusters to form a cluster hierarchy14.

A typical visual representation of this process is called a dendrogram (Figure 3.3). A dendrogram is a tree showing a hierarchical structure of distance-based connections between subsets of points. The similarity between points or groups of points depends on the algorithm. [Bock, 1974] demonstrated (see chapter 2 for details) that for every dendrogram, an ultrametric space can be constructed in which the triangle inequality is redefined as

ܦሺ݈, ݆ሻ ݉ܽݔ ሺܦሺ݈,݉ሻ, ܦሺ݉, ݆ሻሻ.

Figure 3.2: Steps of iteration using the k-means algorithm. After a random initialization of three centroids the nearest data points are assigned to each centroid. Then the centroids are moved to minimalize the distances.

<sup>14</sup> The divisive approach is not considered here (see [Mirkin, 2005, p. 113 ff] for details).

Figure 3.3: Dendrogram of the Hepta data set based on the Ward algorithm. Large changes in fusion levels of the ultrametric portion of the Euclidean distance in the Ward algorithm (y-axis) indicate the best cut. Seven clusters are indicated by red boxes at the y-axis value of 10. If only small changes in the fusion levels exist, it indicates that the algorithm is not able to find a cluster structure.

One of the most common hierarchical clustering algorithms is called *single linkage* (SL) [Florek et al., 1951; Sokal/Sneath, 1963], in which the clustering process is agglomerative [Jain et al., 1999]. In SL, the similarity between two subsets of data points is defined as the minimum distance between data points in these subsets [Duda et al., 2001, p. 553].

Let ܦ ෩be the distance between two clusters ܿଵ ⊂ ܫ and ܿଶ ⊂ ܫ, and let ܦሺ݈, ݆ሻ be the distance between two data points in the input space I; then, SL is defined based on (see [Hennig et al., 2015, p. 9]) ܦ෩ሺܿଵ, ܿଶሻ ൌ min ∈ భ,∈మ ܦሺ݈, ݆ሻ.

In graph theory terminology, this process generates a tree [Duda et al., 2001, p. 553]. If it is allowed to continue until all subsets of points are linked, the result is a (minimal) spanning tree (MST) [Duda et al., 2001, pp. 553, 554; Jain/Dubes, 1988, p. 70]. Of all common algorithms developed before 1968, only SL satisfies all conditions of a "theoretically valid" clustering (see [Jardine/Sibson, 1968] for details).

Another hierarchical clustering algorithm that will be used here is called the *Ward* algorithm [Ward Jr, 1963]. In the Ward algorithm, the similarity between two subsets of points is based on an optimal value of an objective function, which commonly is the sum of squared errors (*SE)*.

Let ܿ ⊂ ܫ and ܿ ⊂ ܫ be two clusters such that ݎ, ݍ ∋ ሼ1, … , ݇ሽ and ܿ ∩ܿ ൌ ሼ ሽ for ݎ്ݍ, and let the data points in the clusters be denoted by ݆ ∈ ܿ and ݈ ∈ ܿ, with the cardinality of the sets being ݇ ൌ หܿห and ൌ |ܿ| and with

 $m(c\_q) = \frac{1}{k} \sum\_{l=1}^{k} j\_l$  and  $m(c\_r) = \frac{1}{p} \sum\_{l=1}^{p} l\_l$ .

then, the *SE* is defined as (see [Theodoridis/Koutroumbas, 2009, pp. 661-663])

$$SE = \frac{k \ast p}{k + p} \sum\_{l=1}^{n} \left( m(c\_k) - m(c\_p) \right)^2$$

In Figure 3.3, the ultrametric property of the Ward algorithm is represented in a dendrogram (for further details, see [Duda et al., 2001, p. 557; Everitt et al., 2001, p. 68ff; Jain/Dubes, 1988]). If the values on the y axis "for the levels are roughly evenly distributed throughout the range of possible values, then there is no principled argument that any particular number of clusters is better or more natural than another" [Duda et al., 2001, p. 551]. "Large changes in fusion levels are taken to indicate the best cut" [Everitt et al., 2001, p. 76]. The cut depicted in Figure 3.3 generates a clustering consisting of seven clusters of roughly equal size. The next clustering method used in this work is called spectral clustering.

*"[It] is a class of graph-based techniques that unravel the structure properties of a graph using information conveyed by the spectral decomposition [eigendecomposition [see [Goodfellow et al., 2016, pp. 42-44]]] of an associated [Laplacian] matrix. The elements of this matrix code the underlying similarities among nodes [data points] of the graph" [Theodoridis/Koutroumbas, 2009, p. 772].* 

 *"The K principal eigenvectors of the Laplacian matrix provide a mapping of the objects into K dimensions. To obtain clusters, the resulting K-dimensional vectors are clustered by standard methods, usually K-means. There are various interpretations of this. […]. For these [Euclidean] data, spectral clustering acts as a remarkably robust linkage method." [Hennig et al., 2015, p. 10].* 

There is a close resemblance between spectral clustering and manifold learning methods [Theodoridis/Koutroumbas, 2009, p. 779]. Here, the clustering algorithm of [Ng et al., 2002] is used to take advantage of the open-source implementation of this method that is available in the R language [R Development Core Team, 2008].

 "Clustering via mixtures of parametric probability models is sometimes in the literature referred to as 'model-based clustering´" [Hennig et al., 2015, p. 10]. With the clustering algorithm of [Fraley/Raftery, 2006] in mind, here, this clustering method is called the *mixture of Gaussians* (MoG) method. The MoG method uses the *expectation maximization* (EM) algorithm (for further details on the EM algorithm, see [Bishop, 2006]).

*The EM algorithm is "an algorithm of alternating maximization applied to the likelihood function for a mixture of distributions model. At each iteration, EM is performed according to the following steps: (1) Expectation: Given parameters of the mixture* ܲ *and individual density functions* ܽ*, find posterior probabilities for observations to belong to individual clusters* ݃ *[…]. (2) Maximization: given posterior probabilities* ݃*, find parameters* ܲ*,* ܽ *maximizing the likelihood function" [Mirkin, 2005, p. 178].* 

The MoG method suffers "from the well-known curse of dimensionality [Bellman, 1957], which is mainly due to the fact that model-based clustering methods are over-parametrized in high-dimensional spaces" [Bouveyron/Brunet-Saumard, 2014, p. 53]. To solve this problem, "for model based clustering, variable selection can be tackled within a Bayesian framework" [Bouveyron et al., 2012]. In the case of the MoG clustering method, the optimal model can be calculated according to the Bayesian information criterion [Aho et al., 2014] for parameterized Gaussian mixtures that are EM initialized using hierarchical agglomeration [Fraley/Raftery, 2002, pp. 10-12].

*"In each hierarchical agglomeration, each stage of merging corresponds to a unique number of clusters, and a unique partition of data. A given partition can be transformed into indicator variables […] which can then be used as conditional probabilities in an M-step of EM for parameter estimation, initializing an EM iteration" [Fraley/Raftery, 2002, p. 11]. Here, the R package mclust is used [Fraley/Raftery, 2006].* 

# **3.2 Structure of Natural Clusters**

*"Clusters can be of arbitrary shapes (structures) and sizes in a multidimensional pattern space. Each clustering criterion imposes a certain structure on the data, and if the data happen to conform to the requirements of a particular criterion, the true clusters are recovered. Only a small number of independent clustering criteria can be understood both mathematically and intuitively. Thus the hundreds of criterion functions proposed in the literature are related and the same criterion appears in several disguises"* [Jain/Dubes, 1988, p. 91].

This section analyzes common clustering algorithms from the perspective of structures, whereas in various other sources, the clustering criterion or objective function has been understood only intuitively. Here, it is argued that the main argument of Jain and Dubes has received overall consent from the clustering community: Different clustering methods tend to implicitly assume different structures of clusters [Duda et al., 2001, pp. 537, 542, 551; Everitt et al., 2001, pp. 61, 177; Handl et al., 2005; Theodoridis/Koutroumbas, 2009, pp. 862, 896; Ultsch/Lötsch, 2016].

# *3.2.1 Types of Structures Sought by Clustering Algorithms*

The argument of Handl et al. is partially adopted here, in which natural clusters are considered to exhibit two types of structures, called compact and connected structures [Handl et al., 2005], as depicted in Figure 3.4. Clusters with compact structures show small variations in their intracluster distances; connected structures are based on the idea of neighborhoods of data points [Handl et al., 2005]. Here, a compact structure is considered to be mainly defined by interversus intracluster distances, whereas a connected structure is primarily defined by neighborhoods *H* of data. Using the definitions presented in section 2.2.1, neighborhoods can be identified based on graph theory. This can result in connected structures consisting of either unidirectional or direction-based neighborhoods.

Figure 3.4: Two types of cluster structures, compact (left) and connected (right), taken from [Handl et al., 2005]. Here, a compact structure is considered to be mainly defined by intra- versus intercluster distances, whereas a connected structure is primarily defined based on neighborhoods ܪ ሺ݇, Γ, ܯሻ and the density of the data.

An example of an algorithm that seeks compact clusters is the k-means clustering algorithm, which imposes a spherical cluster structure [Duda et al., 2001, p. 542; Handl et al., 2005, p. 3202; Hennig et al., 2015, p. 61; Mirkin, 2005, p. 108; Theodoridis/Koutroumbas, 2009, p. 742] such that the clusters cannot be too elongated [L. R. Kaufman/Rousseeuw, 2005, p. 117]. This cluster structure can be found in a data set if "the data points are actually normally distributed" (…) because "the sample mean tends to fall in the region where the samples are most densely concentrated" [Duda et al., 2001, p. 537]. The k-means algorithm is sensitive to noise and outliers [Theodoridis/Koutroumbas, 2009, p. 744]. "This drawback […] gave rise to the k-medoids algorithms […]." The PAM algorithm is less sensitive to outliers. Because of its strong similarity to the k-means algorithm, it is assumed here that PAM also yields a compact spherical cluster structure.

Examples of algorithms that seek connected clusters include density-based methods such as DBscan [Ester et al., 1996] and SL [Handl et al., 2005]. Because SL searches for nearest neighbors [Cormack, 1971, p. 331], it tends to produce connected and chain-like structures [Duda et al., 2001, p. 554; Everitt et al., 2001, p. 67; Hartigan, 1981; Jain/Dubes, 1988, pp. 64-65; Theodoridis/Koutroumbas, 2009, p. 660]. A nearest neighbor is also a Delaunay neighbor (Figure 3.4), leading to a direction-based connected structure of clusters. Spectral clustering is based on graph theory and consequently searches for connected structures [Ng et al., 2002, p. 5] of clusters with "chain-like or other intricate structures" [Duda et al., 2001, p. 582]. This indicates that such an algorithm also searches for direction-based connected clusters (see also [Hennig et al., 2015, p. 10]). "They [spectral clustering methods] are well-suited for the detection of arbitrarily shaped clusters, but can lack robustness when there is little spatial separation between the clusters" [Handl et al., 2005, p. 3202].

The Ward algorithm is sensitive to outliers and tends to find compact clusters of equal size [Everitt et al., 2001, p. 61, Tab. 1] that are ellipsoidal in structure [Ultsch/Lötsch, 2016]. The MoG method uses a mixture-of-distributions approach, which leads to connected clusters. Contrary to [Handl et al., 2005], it is argued here that the MoG method should be able to separate clusters that are non-linear separable (e.g., Chainlink [Ultsch/Vetter, 1995]). Jains and Dubes report that "fitting a mixture density model to patterns" creates clusters with hyper-ellipsoidal shapes [Jain/Dubes, 1988, p. 92]. [Handl et al.] report that the MoG method is very effective for well-separated clusters [Handl et al., 2005, p. 3202].

In the case of self-organizing mapping (SOM)15, the structures have been reported to be of "very general shapes" [Duda et al., 2001, p. 582; Ultsch/Lötsch, 2016]. Similarly to the emergent SOM (ESOM)/U-matrix clustering method [Ultsch et al., 2016a], the Databionic swarm (DBS) method that is discussed later in this work also uses the concept of emergence16, through which novel properties can arise in a system. Emergence leads to clusters whose structures are not predefined.

To summarize, the cluster structures that are theoretically sought by various methods are visualized in Figure 3.5. It should be noted that clustering methods that search for clusters with connected structures should also be able to find compact clusters as long as the distance between

<sup>15</sup> However, for k-means-SOM of the batch type, spherical or well-separated structures have been reported [Handl et al., 2005, p. 3202] (see the SOM section in chapter 4 for the differences between ESOM and k-means-SOM). 16 Definition, see chapter 7.3, p. 81-82

Figure 3.5: Overview of the cluster structures that common clustering algorithms tend to find. It is based on the literature, except for the MoG algorithm17, for which an educated guess is made. The subgroup of DBscan clustering is characterized based on arguments presented in section 3.2.1, for the definition of emergent see chapter 7.3.

clusters is large or the density between clusters is very low (see also [Handl et al., 2005, p. 3202]); e.g., "single-linkage clusters detect high-density clusters if there is a low enough valley separating them" [Hartigan, 1981]. However, methods that search for compact and spherical structures cannot be expected to find connected structures.

#### *3.2.2 Quality of Clustering*

*"[The quality of clustering is measured using a] "procedure for validating a cluster structure […]. This can be based on an internal index, an external index or resampling. An internal index scores the degree of correspondence between the data and the cluster structure. An external index compares the cluster structure with a structure given externally. A resampling is used to see whether the cluster structure is stable with respect to data change" [Mirkin, 2005, p. 205] (see also [Jain/Dubes, 1988, p. 161ff]).* 

Internal and external indices are also often called *intrinsic* or *extrinsic* indices, respectively; here, they are referred to as *supervised* or *unsupervised* indices, respectively. The simplest example of a supervised index is the accuracy, which is defined as follows:

$$\text{\textbf{шири}} \text{ ол } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{ о } \textbf{u} \text{$$

$$\text{Accuracy} \left[ \% \right] = \frac{\left[ \text{No. of true positives} \right]}{\left[ \text{No. of cases} \right]} \tag{3.1}$$

In Eq. 3.1, the number of true positives is the number of labeled data points for which the label defined by a prior classification is identical to the label defined after the clustering process. To determine either the number of clusters or the clustering quality, two approaches are generally possible. Covariance matrices can be calculated, or the intra- versus intercluster distances can be compared to evaluate the homogeneity versus heterogeneity of the clusters. In the literature, a sufficient overview of 15-30 indices has already been provided [Charrad et al., 2012; Dimitriadou et al., 2002], and these indices will not be further discussed here. A special type of unsupervised indices, referred to as quality measures for projection methods, will be separately

<sup>17</sup> Also known as model-based clustering.

introduced in chapter 6. Two unsupervised indices and corresponding visualizations are presented in the following sections.

#### *3.2.2.1 Heatmaps*

A heatmap is an example of an unsupervised index. For the ordering of the data points in heatmaps, dendrograms are often used. They enable the visualization of high-dimensional information and dissimilarity matrices without projecting them into a lower-dimensional space. Their use strongly depends on the sequence of the observations. For cluster validation, it is desirable to plot observations that are in the same cluster together [Hennig et al., 2015].

*"[A heatmap] consists of a rectangular tiling, with each tile shaded on a color scale to represent the value of the corresponding element of the data set. The rows (columns) of the tiling are ordered such that similar rows (columns) [in the sense that they are in the same cluster] are near each other" [Wilkinson/Friendly, 2012]. "The cluster heat map is a rectangular tiling of a data matrix with cluster trees appended to its margins. Within a relatively compact display area, it facilitates inspection of joint cluster structure" [Wilkinson/Friendly, 2009].* 

Unlike in [Wilkinson/Friendly, 2009; Fig. 1], in Figure 3.7, the dendrogram between the variables is disregarded and only the ݊ݔ݊ heat map of the distance matrix is shown.

#### *3.2.2.2 Silhouette plots*

The Silhouette plot is a common unsupervised index for visual evaluation of a clustering [L. R. Kaufman/Rousseeuw, 2005].

*"A score function s:* ܺ → ሾെ1, 1ሿ *evaluates the positioning of data objects inside their assigned cluster. Let a(x) denote the average distance between x and all other objects of the same cluster, and b(x) denotes the smallest average distance between x and all objects of another cluster. The silhouette score follows as* ሺݔሻ ൌ ሺ௫ሻିሺ௫ሻ ௫ሼሺ௫ሻ,ሺ௫ሻሽ *. Silhouette scores similar to 1 indicate objects that have been assigned to an appropriate cluster, whereas −1 indicates objects that have been badly classified. Silhouette scores similar to 0 indicate objects that lie in between clusters. Each cluster is represented by one silhouette, showing which objects lie within the cluster and which objects merely hold an intermediate position. The entire clustering is displayed by plotting all silhouettes into a single diagram, from which the quality of the clusters can be compared" [Herrmann, 2011, pp. 91-92].* 

A reasonable clustering is characterized by a silhouette width of greater than 0.5, and an average width below 0.2 should be interpreted as indicating a lack of any substantial cluster structure [Everitt et al., 2001, p. 105]. However, it is evident that silhouette scores assume clusters that are spherical or Gaussian in shape [Herrmann, 2011, pp. 91-92].

#### **3.3 Problems with Clustering Methods**

To illustrate several problems encountered when using common clustering methods, a domain expert measured genetic data for subjects who were known either to be healthy or to have one of 3 subtypes of leukemia. Here, a typical knowledge discovery task could be to identify patterns in the cancer subtypes based on the four diagnoses leading to the prior classification**.** 

 *"[I]t is a common practice among researchers to employ a variety of different clustering techniques to analyse a dataset, and to use visual inspection18 and prior biological knowledge to select what is considered the most 'appropriate' result" [Handl et al., 2005, pp. 3202-3203].* 

 Consequently, the first step would be to confirm that the structure defined by the classification distinguishing the healthy patients from the non-healthy ones does indeed exist in this data set.

<sup>18</sup> The application of visual inspection will be reported in chapter 6, Fig. 1, resulting in arbitrary projections.

The data set used as an example to illustrate the general problem described above contains data representing 7747 variables for 554 subjects (see chapter 9 for details). Of the subjects, 109 are healthy, 15 have acute promyelocytic leukemia (APL), 266 have chronic lymphocytic leukemia (CLL), and 164 have acute myeloid leukemia (AML). There is a possibility that some subjects might be misclassified, but a future publication will address this diagnostic.

The heatmap and the silhouette plot presented in Figure 3.7 and 3.6 show that this data set is defined by discontinuities because the intracluster distances are small and the intercluster distances large. Hence, the leukemia data set is a high-dimensional data set with natural clusters that are specified by the illness status and defined by discontinuities19.

Table 3.1 shows the accuracies of common clustering algorithms computed by comparing the clustering results with the prior classification made available by the domain expert. The default settings were used for all algorithms, and the number of clusters was assumed to be four. The MoG algorithm cannot be applied without first using dimensionality reduction methods because the dimensionality of the data set is too high. Only one algorithm (Ward) is able to fully reproduce the prior classification. However, a classification should typically be reproduced using more than one algorithm, and the reproduction of a classification with 100% accuracy is unusual.

This example illustrates that "Clustering algorithms will create clusters whether the data are naturally clustered or purely random" [Jain/Dubes, 1988, p. 201] and "By imposing a predefined shape on the clusters, classical algorithms occasionally suggest a cluster structure in homogenously distributed data or assign points to incorrect clusters" [Ultsch/Lötsch, 2016].

To summarize, the unsupervised indices, namely, the heatmap and the silhouette plot, agree with the prior classification provided by the domain expert, whereas the external index of accuracy and the projections of the data5 disagree with the domain expert. The question arises whether this data set contains natural clusters and, if so, how the structure of these natural clusters can be correctly identified or how the optimal clustering (or projection) algorithm can be chosen for the knowledge discovery task. This work will propose approaches and solutions to these problems.

Figure 3.6: Silhouette plot of the leukemia data set indicates a cluster structure.

<sup>19</sup> It should be remarked that common data-driven methods as well as the heatmap and Silhouette plot do not reproduce the (sub) classification(s) of AML (like FAB subtypes) or CLL of research in this area, e.g. [Bene et al., 1995; Bennett et al., 1985; Vardiman et al., 2009; Haferlach et al., 2010], for CLL [Rosenwald et al., 2001].

Table 3.1: Accuracy results for common clustering algorithms. No result could be calculated for the MoG algorithm (also known as model-based clustering).


Figure 3.7: The heatmap of the leukemia data set with at least one outlier (red line). The intracluster distances are distinctively smaller than the intercluster distances. Cls1 =APL, Cls2= healthy, Cls3=CLL, Cls4=AML.

License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. **Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **4 Methods of Projection**

Dimensionality reduction techniques reduce the dimensions of the input space to facilitate the exploration of structures in high-dimensional data. Two general dimensionality reduction approaches exist: manifold learning and projection. Manifold-learning methods attempt to find a sub-space in which the high-dimensional distances can be preserved. These sub-spaces may have a dimensionality of greater than two. However, only two- or three-dimensional representations of high-dimensional data are easily graspable for to the human observer.

The goal of this chapter is the visualization of structures in high-dimensional data. Venna et al. argued that "manifold learning methods are not necessarily good for […] visualization […] since they have been designed to find a manifold, not compress it into a lower dimensionality" [Venna et al., 2010, p. 452], and it has been shown by van der Maaten et al that they do not outperform classical principal component analysis (PCA) for real-world tasks [L. J. van der Maaten et al., 2009].

Therefore, this chapter focuses on common projection methods. Many projection methods are characterized by an objective function that is optimized using gradient descent or a corresponding learning algorithm. The quality of the projection and, consequently, of the visualization will critically depend on the similarity concept chosen as the basis of the objective function, which may be based on either distance or local proximity; thus, the methods will be categorized on this basis. This chapter will attempt to relate the various projection approaches to the compact and connected structure types introduced in the previous chapter.

# **4.1 Common Approaches**

Here, projection is used as a method for visualizing high-dimensional data in a two-dimensional space such that the discontinuities in the data are captured. Thus, the quality of a projection critically depends on the chosen similarity concept. This concept may be defined based on either distance or local proximity. The former type of similarity describes the arrangement of all given points in space and is sometimes called topography; the latter compares local neighborhoods and is sometimes called topology. Here, projections are called focusing if they are constructed using an iterative learning process that first adapts to the global intercluster distances and then focuses on more local intracluster distances.

# *4.1.1 Principal Component Analysis (PCA)*

PCA assumes that the directions in the input space that show the highest variance contain the most information about the data set [Hotelling, 1933]. The coordinate system of the input space is replaced with a (principal) coordinate system in which the variance of the data is maximized. This is achieved by finding a set of weighted linear combinations of the original variables, where the weights are found through eigendecomposition (for a definition, see [Goodfellow et al., 2016, pp. 42-44]).

Pearson proposed an equivalent definition based on an objective function in which the average projection cost is minimized [Pearson, 1901]. The projection cost is defined in terms of the mean squared distances between the points ݈∈ܫ and the projected points ݆∈ܱ:

$$E = \frac{1}{n} \sum\_{l \in I}^{n} D(l, \hat{\jmath}) \qquad \qquad \text{(4.1)}$$

where ଔ̂ൌ݆ ∑ ܾ ∗ ݑෝప ୀାଵ ൌ ∑ ܾ ∗ ݑ ୀଵ ∑ ܾ ∗ ݑෝప ୀାଵ has the same dimension as ݈ ∈ ܫ. Here, *n* is the dimension of the input space *I*, m is the dimension of the output space *O*, the ݑ are the basis vectors, and the ܾ are constants. The minimization of *J* is achieved by choosing the basis vectors to be eigenvectors of the covariance matrix constrained by the orthonormality conditions [Duda et al., 2001, pp. 114-117]:

$$Cov(l, j) = \frac{1}{n} \sum\_{l \in I} (l - mean\_l) \left( j - mean\_l \right) \tag{4.2}$$

$$Cov(l, j) \* u\_l = \lambda\_l u\_l \tag{4.3}$$

Now, the objective function E can be redefined in (4.4) in terms of the eigenvalues ߣ in (4.1) as

$$E = \sum\_{\ell=m+1}^{n} \lambda\_{\ell} \qquad \qquad \text{(4.4)}$$

where *n* is the dimension of the input space I and m is the dimension of the output space O. The largest eigenvalues correspond to the 1, … , ݉ dimensions with the largest variance. Dimensions of the input space with small variances are discarded. Thus, PCA is an orthogonal projection of the data into a lower-dimensional space. It should be noted that "PCA remains a rather basic method and suffers from many shortcomings" [Lee/Verleysen, 2007, p. 226].

#### *4.1.2 Independent Component Analysis (ICA)*

*"Independent component analysis (ICA) is a method for finding underlying factors or components from multivariate (multi-dimensional) statistical data. What distinguishes ICA from other methods is that it looks for components that are both statistically independent, and nonGaussian" [Hyvärinen et al., 2004].* 

Let ܫ ൌ ሺ݈ଵ,…,݈ሻ be defined as the matrix of the data in the input space. ICA assumes that *I* is a linear combination of non-Gaussian independent components *S* as follows:

$$I = \mathcal{S} \* A \qquad \text{ (4.5)}$$

where *A* is a linear mixing matrix and ܵ ൌ ሺ݆ଵ,…,݆ሻ,݆ ∈ܱ. ICA unmixes *I* by estimating a matrix ܹൌܣିଵ such that

$$I\*\mathcal{W} = \mathcal{S} \tag{4.6}$$

With the goal of estimating *W*, the central limit theorem and matrix search can be used to maximize the non-Gaussianity. In the fastICA algorithm [Hyvärinen, 1997], the non-Gaussianity is defined as the negentropy *F*, and it is approximately maximized by maximizing the objective function in (4.7)

$$E(j) \approx \left[ F\{ G(j) \} - F\{ G\{ N(m=0, \mathbf{s}=1) \} \} \right]^2 \tag{4.7}$$

where *N* is a Gaussian and G is a contrast function, e.g., ܩሺݑሻ ൌ െ݁ݔሺെ ௨<sup>మ</sup> <sup>ଶ</sup> <sup>ሻ</sup>.

Constraints on the estimated contrast function *G* include pre-whitening and the centering of the data in the input space [Hyvärinen et al., 2004].

#### *4.1.3 Non-linear metric multidimensional scaling (MDS) techniques*

Multidimensional scaling (MDS) was originally proposed by [Torgerson, 1952]. MDS techniques attempt to preserve the pairwise distances *D(l, j)* of the input space in the output space to the greatest possible extent. Therefore, MDS techniques minimize an objective (error) function *E* that is, as given in [Kruskal, 1964b], defined as

$$E(D,d) = \sum\_{j,l=1, j$$

where *f(D(l, j))* is a *non-metric*, monotonic transformation of the distances in the input space [Kruskal, 1964a, p. 7]. *E* is often called the stress, and *E* is minimized in an attempt to reproduce the general rank ordering of the distances. This minimization is usually performed via gradient descent.

However, the objective function *E* depends on the scale on which the distances are measured. It is preferable to normalize the objective *E* to reduce it to the same units in which the distances are expressed (Eq.4.9). Sammon mapping [Sammon] is one type of MDS technique and uses the error function

$$E(D,d) = \frac{1}{\Sigma\_{j,l=1,j<1}^n D(l,j)} \sum\_{j,l=1,j$$

#### *4.1.4 Curvilinear Component Analysis (CCA)*

When a non-linear structure is being analyzed, MDS cannot reproduce all distances. Therefore, [Demartines/Hérault] proposed a projection method that favors local neighborhoods. Curvilinear component analysis (CCA) attempts to reproduce short distances before reproducing long distances [Demartines/Hérault, 1995]. The objective function is defined in (4.10) as

$$E(D,d) = \sum\_{j,l=1, j$$

where ݄: Թ → ሾ0,1ሿ is a neighborhood function that depends on a radius R as follows:

$$h(D(l,j),R) = \begin{cases} 1, & ifD(l,j) \le R \\ 0, & otherwise \end{cases} \tag{11}$$

#### *4.1.5 t-Distributed Stochastic Neighbor Embedding (t-SNE)*

The t-distributed stochastic neighbor embedding (t-SNE) technique is an enhanced version of SNE [Hinton/Roweis, 2002] in which the Kullback-Leibler divergence (KLD) is symmetrized and the crowding problem solved. The latter is achieved by redefining the conditional probabilities in the output space O through the application of Student's t-distribution with

$$p(l|j) = \begin{cases} \frac{\{1 + \mathbf{d}(l,j)^2\}^{-1}}{\sum\_{\mathbf{l}, \mathbf{j} \in \mathbf{l}} (1 + \mathbf{d}(l,j)^2)^{-1}}, & l \neq j \\\ 0, & l = j \end{cases} \tag{4.12}$$

In [Van der Maaten/Hinton], the distance between two data points is redefined as the conditional probability that j would pick l, where ݈, ݆ ∈ ܫ, as follows:

$$P(l|j) = \begin{cases} \frac{\exp\left(-\frac{D\{l,j\}^2}{2\sigma\{l\}^2}\right)}{\sum\_{l,j\in I} \exp\left(-\frac{D\{l,j\}^2}{2\sigma\{l\}^2}\right)}, & l \neq j\\ & 0, \qquad l = j \end{cases} \tag{4.13}$$

where ߪሺ݈ሻ is the variance of a Gaussian that is centered on data point *j*. If the projection is correct, then the conditional probabilities will be equal [Van der Maaten/Hinton]. Therefore, the objective function is defined using the symmetric KLD in (14) as

$$E = \sum\_{l} \sum\_{j} \frac{P(l|j) + P(j|jl)}{2n} \* \log\left(\frac{\frac{P(l|j) + P(j|jl)}{2n}}{p(l|j)}\right) \tag{4.14}$$

#### *4.1.6 Neighborhood Retrieval Visualizer (NeRV)*

[Venna et al., 2010] reintroduced the idea of misses used by [Ultsch/Herrmann, 2005], where misses are similar data points ሺ݈ூ, ݆ூ) ∈ ݅ that are mapped onto far separated points ሺ݈ை, ݆ைሻ ∈ ܱ [Ultsch/Herrmann, 2005]. Conversely, if a pair of closely neighboring positions ሺ݈ை, ݆ைሻ represents a pair of distant data points, then this pair is called a false positive. From the information retrieval perspective, this approach allows one to define the precision ܨ and the recall ܨோ for the case in which the neighborhoods are simply binary. However, [Venna et al., 2010] goes a step further by replacing such binary neighborhoods with probabilistic ones, which are loosely inspired by the SNE approach [Hinton/Roweis, 2002]. The neighborhood of the point l is defined in terms of the relevance of the ݆∈ܫ points around l:

$$p\_l(j) = \frac{\exp\left(-\frac{D\{l, j\}^2}{\sigma\_l^2}\right)}{\Sigma\_{k \neq j} \exp(-\frac{D\{l, k\}^2}{\sigma\_l^2})}\tag{4.15}$$

where ߪ is set to the value for which the entropy of ሺ݆ሻ is equal to log(knn) and knn is a rough upper limit on the number of relevant neighbors that is set by the user [Venna et al., 2010]. The authors propose a default value of 20 effective nearest neighbors. Similarly, the corresponding neighborhood in the output space is defined as

$$q\_l(j) = \frac{\exp\left(-\frac{d(l,f)^2}{\sigma\_l^2}\right)}{\Sigma\_{k \neq f} \exp(-\frac{d(l,k)^2}{\sigma\_l^2})}\tag{4.16}$$

These neighborhoods are compared based on the mean of the KLD, which is used to define the precision ܨ and recall ܨோ:

$$F\_R = -\frac{1}{N} \Sigma\_l^N \Sigma\_{f\neq l} p\_f(l) \ast \log(\frac{p\_f(l)}{q\_f(l)}) \tag{4.17}$$

$$F\_P = -\frac{1}{N} \sum\_{l}^{N} \sum\_{j \neq l} q\_f(l) \ast \log(\frac{q\_f(l)}{p\_f(l)}) \tag{4.18}$$

The objective function is then defined in (19) as

$$E = \lambda \sum\_{\text{lj}} p\_f(\text{l}) \* \log\left(\frac{p\_f(\text{l})}{q\_f(\text{l})}\right) + (1 - \lambda) \sum\_{\text{lj}} q\_f(\text{l}) \* \log\left(\frac{q\_f(\text{l})}{p\_f(\text{l})}\right) \tag{4.19}$$

The objective function E is non-linearly optimized via conjugate gradient descent. In the absence of prior knowledge, the neighborhoods *p* are defined as symmetric Gaussians or heavytailed distributions. The weighting between precision and recall must be set by the user using the parameter ߣ. Weighting precision over recall means that if points are similar to each other in the output space, then they will also be similar to each other in the input space, whereas weighting recall over precision means that if points are similar in the input space, then they will also be similar in the output space. Note that the KLD and the symmetric KLD do not follow the triangle inequality for metric spaces.

The projection approach used in the Neighborhood Retrieval Visualizer (NeRV) method is randomly initialized by default, resulting in stochastic projections (see Figure 4.1). However, there exists an option to use PCA projection for initialization.

#### **4.2 Emergent Self-Organizing Map (ESOM)**

Self-organizing (feature) map (SOM) was invented by [Kohonen, 1982a, 1982b] and is a type of unsupervised neural learning algorithm. In contrast to other neural network models20 a SOM consists of an ordered two-dimensional layer of neurons called units. Neurons are interconnected nerve cells in the human neocortex [H. Ritter et al., 1992, p. 22], and the SOM approach was inspired by somatosensory maps (e.g. see [Hennig et al., 2015, p. 421] cites [Haykin, 1994], see also [Kandel, 2012, p. 335]). There are two types of SOM algorithms: online and batch [Fort et al., 2001]. The first is stochastic, whereas the second is deterministic, which means that it yields reproducible results for a given parameter setting. However, Fort et al. have argued "that randomness could lead to better performances" [Fort et al., 2001, p. 12].

The main differences between batch-SOM [Kohonen/Somervuo, 2002] and online-SOM [Kohonen, 1995] lie in the updating and averaging of the input data. In batch-SOM, prototypes (see Eq. 4.20 below) are assigned to the data points and the influences of all associated data points are calculated simultaneously, in contrast to online-SOM, in which sequential training of the neurons is applied (as described in detail below). The batch-SOM method has been shown to produce topographic mappings of varying quality depending on the pre-defined parametrization [Fort et al., 2001], and "the representation of clusters in the data space on maps trained with batch learning is poor compared to sequential training" [Nöcker et al., 2006]. An important comparison between the batch-SOM approach and ant-based clustering was presented by [Herrmann/Ultsch, 2008c] and will be elaborated upon in chapter 7. No objective function is used in online-SOM [Lee/Verleysen, 2007, p. 241], and SOM remains a reference tool for twodimensional visualization [Lee/Verleysen, 2007, p. 244].

In one common approach to applying the SOM concept, the algorithm acts as an extension of the k-means algorithm [Cottrell et al., 2016] or is a partitioning method of the k-means type [Murtagh/Hernández-Pajares, 1995]. In such a case, only a few units are used in the SOM algorithm to represent the data [Reutterer, 1998], which results in direct clustering of the data. Here, each neuron can be considered to represent a cluster. For example, Cottrell and de Bodt

<sup>20</sup> For an overview, see [H. Ritter et al., 1992], for deep learning see [Goodfellow et al., 2016].

used 4x4 units to represent the 150 data points in the Iris data set ([Ultsch et al., 2016a] cites [Cottrell, 1996]). Therefore, the conventional SOM algorithm is called k-means-SOM here. This SOM algorithm also has two common extensions called Heskes-SOM [Heskes, 1999] and Cheng-SOM; these two extensions include objective functions [Cheng, 1997] and are not discussed further in this thesis**.** The optimization of objective functions in general will be discussed in chapter 6, where it will be argued that it is not useful for the goal of this thesis. Chapter 7 will show that objective functions are incompatible with self-organization.

The other approach to applying SOM is to exploit its emergent phenomena through self-organization, in which case it is necessary to use a large number of neurons (>4000) [Ultsch, 1999]. This enhancement of the online-SOM approach is called emergent SOM (ESOM). In such a case, the neurons serve as a projection of the high-dimensional input space instead of a clustering, as is the case in k-means-SOM.

Let ܯ ൌ ሼ݉ଵ,… ,݉ ሽ be the positions of neurons on a two dimensional lattice21 (feature map) and ܹ ൌ ሼݓሺ݉ሻ ൌ ݓ| ݅ ൌ 1, … ݊ሽ the corresponding set of weights or prototypes of neurons, then, the SOM training algorithm constructs a non-linear and topology-preserving mapping of the input space by finding the best matching unit (ܯܤܷ (for each ݈∈ܫ:

$$bbmu(l) = \underset{m\_l \in \mathcal{M}}{\operatorname{argmin}} \{ D(l, \mathcal{w}\_l) \}, \quad i \in \{1, \dots, n\} \tag{4.20}$$

if in Eq. 4.20 a distance in the input space I between the point *l* and the prototype ݓ is denoted. In each step, SOM learning is achieved by modifying the prototypes (weights) in a neighborhood as follows:

$$
\Delta\mathfrak{w}(R) = \mathfrak{v}(R) \* h(bm\mathfrak{u}(l), m\_l, R) \* (l - \mathfrak{w}(m\_l))\tag{4.21}
$$

The cooling scheme is defined by the neighborhood function ݄: ܯ ൈ ܯ ൈ Թା → ሾെ1,1ሿ and the learning rate ߟ: Թା → ሾ0,1ሿ, where the radius R decreases until ܴൌ1 in accordance with the definition of the maximum number of epochs. In contrast to all previously introduced projection methods, no objective function is used in the ESOM algorithm. Instead, ESOM uses the concept of self-organization (see chapter 6 for further details) to find the underlying structures in data. The structure of a (feature) map is **toroidal**; i.e., the borders of the map are cyclically connected [Ultsch, 1999], which allows the problem of neurons on borders and, consequently, boundary effects to be avoided. The positions ݉∈ܯ of the BMUs exhibit no structure in the input space [Ultsch, 1999]. The structure of the input data emerges only when a SOM visualization technique called U-matrix is exploited [Ultsch/Siemon, 1990].

Let ܰሺ݆ሻ be the eight immediate neighbors of ݉ ∈ ܯ, let ݓ ∋ ܹ be the corresponding prototype to ݉, then the average of all distances between prototypes ݓ

$$u(j) = \frac{1}{n} \sum\_{l \in N(j)} D(\mathbf{w}(m\_l), \mathbf{w}(m\_l)), n = |N(j)|\tag{4.22}$$

A display of all U-heights in Eq. 4.22 is called a U-matrix [Ultsch/Siemon, 1990].

<sup>21</sup> In general this work uses the term grid if the resulting tiling is hexagonal and lattice if the resulting tiling is rectangular (see connected graph). In the context here the distinction is not important, therefore we use the term (feature) map.

*"By formalizing the displayed structures, [Lötsch/Ultsch, 2014] showed that the U-matrix is an approximation of the Voronoi borders of the high-dimensional points in the output space:* 

*Let bmu(l) and bmu(j) be the BMUs of data points l and j, where bmu(j) and bmu(l) have bordering Voronoi cells. On the borderline, there is a vertical plane (AU-height), which is the distance D(l, j) > 0 between the data points in the input space. In sum, the abstract U-matrix (AU-matrix) is the Delaunay graph of the BMUs weighted by the corresponding Euclidean distances in the input space" [Thrun et al., 2016a, p. 9].* 

#### *4.2.1 Visualizations of SOMs*

This section is reproduced in its entirety from [Thrun et al., 2016a]. The result of every Kohonen SOM algorithm is a set of neurons located on a map where a set *W* of prototypes corresponds to a set *M* of positions. In general, the positions on *M* are restricted to a grid/lattice, but a few approaches exist that change the positions in *M*, like Adaptive Coordinates [Merkl/Rauber, 1997]. Because these approaches are not grid/lattice based, they are not considered any further. BMUs define the locations of input points on the map. However, they exhibit no structure of the input space for a SOM [Ultsch, 1999]. However, the goal is to grasp the high-dimensional data structure and possibly even visualize cluster boundaries. Therefore, post-processing of the neurons is required for an informative representation of high-dimensional data. Three standard approaches are found in the literature:

The first approach projects the set *W* of prototypes with MDS [Torgerson, 1952] or some of its variants to a two-dimensional space [Kaski et al., 2000; Sarlin/Rönnqvist, 2013]. The result is mapped into the CIELab color space [Colorimetry, 2004]. In this uniform color space, perceptual differences in colors correspond to Euclidean distances in the map space as precisely as possible [Kaski et al., 2000]. The next two approaches visualize either the distances or density of the prototypes.

The second approach defines receptive fields around each position in *M*. The unified distance matrix (U-matrix), [Ultsch/Siemon, 1990] or one of its variants [Häkkinen/Koikkalainen, 1997; Hamel/Brown, 2011; Kraaijveld et al., 1995] , represents distances of prototypes (see equations above) by using proportional intensities of gray shades, color hues, shape or size. In [Kraaijveld et al., 1995], every neuron corresponds to a pixel. The gray value of each pixel is determined by the maximum unit distance from the neuron to its four neighbors (up, down, left, right). The larger the distance is, the lighter the gray value is. In [Häkkinen/Koikkalainen, 1997], additional unit distance visualization approaches are explained. The shapes and sizes of the receptive fields describe the dissimilarity of corresponding neurons. Apart from the U-matrix, visualizations of receptive fields in three dimensions or specific components of prototypes with receptive fields in two dimensions have been attempted [Vesanto, 1999]. It is also possible to add SOM quality measures to the receptive fields in a third dimension, e.g., [Vesanto et al., 1998].

The third approach connects the positions *M* by way of a specific scheme. In [Hamel/Brown, 2011], in addition to a U-matrix approach, neurons are connected with lines along the maximum gradient. The authors claim that clusters are the always-connected components of the graph defined by the U-matrix. [Merkl/Rauber, 1997] omitted the receptive fields approach, merely connecting map positions with lines, where the connection intensities reflect the similarity of the underlying prototypes. [K. Tasdemir/Merenyi, 2009] proposed the CONNvis technique, which visualizes the feature map by connecting neurons whose corresponding prototypes are adjacent in an input space with a dimensionality equal to that of the high-dimensional data. The

width of each connection line is proportional to the strength of the connection [K. Tasdemir/Merenyi, 2009].

In sum, all above described visualizations of large SOMs require an expert in the field for interpretation. To the best of the present author's knowledge, there are no 3D visualizations of ESOMs based on a 2D feature map currently in use22.

#### *4.2.2 Clustering with ESOM*

Combining ESOM with the U\*-matrix approach enables an application of [Ultsch et al., 2016a]:

 *"A single wall of AU-matrix represents the true distance information between two points in the data space. Valid density information at the midpoints between a BMU and a second BMU is calculated for [the] P-matrix, since the same volumes, i.e. spheres of a predefined radius, are used. The AU\*matrix therefore represents the true distance information between two points weighted by the true density at the midpoint. The representation is such that high densities shorten the distance and low densities stretch this distance. Using transitive closure for these weighted distances allows classical clustering algorithms (AU\*clustering) to actually perform distance- and density-based clustering, taking into account the complex structure of partially entwined clusters within the data."* 

In contrast to the Databionic swarm approach, in which the shortest paths between AU-distances are calculated23, this clustering approach uses only the direct neighborhood of the projected points. A computation of the abstract P-matrix is necessary because ESOM itself does not consider density. Overlaying a political map on the U\*-matrix map reveals errors made by the ESOM algorithm during the annealing process. The political map shows the Voronoi areas of each cluster, where the color of each cluster area corresponds to the cluster label. The clustering is solid if every cluster consists of only one connected area, of which the borders are mountain ranges. The clustering process is sensitive to the parcel window parameter that is required for estimating the density of the high-dimensional data, and the clustering process is mostly conducted through an interactive approach requiring human intervention24.

#### **4.3 Types of Projection Methods**

In the previous section, it was shown that projection methods such as CCA, MDS and NeRV are characterized by an objective function that is optimized using gradient descent or a corresponding learning algorithm, whereas others, such as ESOM, are not. However, the first obvious difference between types of projection methods is that between linear projection methods such as PCA or ICA and non-linear projection methods. Linear projection methods are only able to rotate the high-dimensional data space and choose the most interesting dimensions, such as the dimensions with the highest variance, as is the case for PCA.

In contrast to this approach, non-linear projection methods are able to disentangle structures, e.g., represent the Chainlink data set25 in such a way that the two clusters are separated in the output space. The next major distinction between projection methods is the deterministic versus the stochastic approach. Some projection methods will always produce the same projection in the output space if all parameters remain unchanged. However, for many projection methods, such as t-SNE, their projections in the output space will drastically change with different trials

<sup>22</sup> Standard ESOM visualizations using the U-matrix are shown in supplementary D. 23 See chapter 7 for details.

<sup>24</sup> For this reason, the ESOM/U-matrix clustering approach cannot be compared with other approaches in chapter 10. 25 See the next chapter for details.

even when all settings of the projection method remain unchanged (see also examples in chapter 5, Figure 5.2). Hence, the results of deterministic methods are always reproducible, whereas stochastic methods may yield irreproducible results and require a statistical approach to assess their quality. Similarly to MDS techniques, deterministic projection methods are often based on Lyapunov functions (for further details, see [Lyapunov, 1992]). Here, it is assumed that linear and MDS techniques should only be able to visualize compact structures, which are based on the intra- versus intercluster distances of natural clusters (see the previous chapter for details).

Stochastic methods are mainly characterized by either a focusing approach or a self-organizing approach. Let k be the neighborhood extent, and let ߁ be a graph; then, a projection method is of the focusing type if the result is constructed through an iterative learning process that adapts first to global neighborhoods ܪሺ݇ଵ 1, ߁, ܫሻ and later to local neighborhoods ܪሺ݇ଶ, ߁, ܫሻ, where ݇ଵ ݇ଶ. Therefore, such methods should be capable of visualizing connected structures (see the previous chapter for details) if the annealing process is correctly chosen.

*Self-organization* is defined as spontaneous pattern formation by a system itself, without any central control26 [Kelso, 1997, p. 8 ff.]. By means of self-organization, some projection methods, such as ESOM or Pswarm, are able to project data without requiring an objective function. Thus, self-organizing methods do not implicitly predefine the structures that are sought in the data of interest. The Pswarm projection method will be introduced in chapter 8 as part of the Databionic swarm clustering approach. An overview of the various types of projection methods is shown in Figure 4.1.

Assumptions regarding the types of structures that the projection methods in Figure 4.1 are able to visualize will be either disproven or verified in chapter 10 based on 100 trials per projection method (with the exception of ICA due to technical difficulties) of five artificial three-dimensional data sets.

<sup>26</sup> Further explained in chapter 7, p.79 ff.

Figure 4.1: Overview of different types of projection methods. Here, it is argued that linear methods and MDS techniques are only able to visualize compact structures (shaded with the first pattern), whereas focusing projection methods should be able to visualize connected structures (shaded with the second pattern) if the annealing scheme is correctly chosen. For self-organizing methods, the structures that are sought in the data are not implicitly predefined. The ellipses indicate that this overview includes only common projection methods. Pswarm will be introduced in chapter 8 as a new approach based on swarm intelligence.

License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. **Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **5 Visualizing the Output Space**

Projection methods are a common approach to dimensionality reduction with the aim of transforming high-dimensional data into a low-dimensional space. For data visualization purposes, projections into two dimensions are considered here. However, when the output space is limited to two dimensions, the low-dimensional similarities cannot completely represent the high-dimensional distances, which can result in a misleading interpretation of the underlying structures.

Nonetheless, visualization techniques based on scatter plots produced using a projection method (usually principal component analysis (PCA)) remain the state of the art in cluster analysis (e.g., [Everitt et al., 2001, pp. 31-32; Hennig et al., 2015, pp. 119-120, 683-684; Mirkin, 2005, p. 25; G. Ritter, 2014, p. 223]). Even if one disregards that "PCA remains a rather basic method and suffers from many shortcomings" [Lee/Verleysen, 2007, p. 226], visualization based on such a scatter plot is questionable in principle. Several two-dimensional scatter plots of elementary three-dimensional data sets and one high-dimensional data set (see also Figure 6.1 in the next chapter) will be presented to illustrate this claim.

Thereafter, structure preservation will be defined in this chapter to serve as the basis for a new method of visualization. This new concept with regard to the visualization of projected points in a two-dimensional output space is called the generalized U-matrix approach. In the generalized U-matrix approach, similarities between high-dimensional data are represented as valleys, and dissimilarities are represented as mountains or ridges. For the computation of the generalized U-matrix, the generation of the topographic map (see chapter 5.3) and island visualization the CRAN R package GeneralizedUmatrix was used [Thrun/Ultsch, 2017b].

# **5.1 Examples**

In Figure 5.1, the Hepta data set is shown. The Hepta data set [Moutarde/Ultsch, 2005] consists of 7 clusters that are clearly separated by distance, which means that the intracluster distances are small and the intercluster distances are large (for details, see chapter 9). This gives rise to structures that are clearly defined by discontinuity and consequently can be characterized as natural clusters.

Projections of the Hepta data set obtained by applying three of the projection methods introduced in the previous chapter are shown in Figure 5.2: PCA, curvilinear component analysis (CCA) and t-distributed stochastic neighbor embedding (t-SNE). In total, four projections are evaluated, including two t-SNE projections, denoted by *t-SNE (1)* and *t-SNE (2).* PCA yields the best representation of the clusters. With the default parameters, CCA adds excessive gaps around three points. In t-SNE (1), generated using the default parameter settings of t-SNE, the density of the data is overestimated, and wide gaps are added between two points and their corresponding cluster. When one parameter of the t-SNE algorithm is changed, resulting in t-SNE (2), the data clusters are not preserved because many random gaps are added.

Figure 5.1: The three-dimensional Hepta data set consists of 7 clusters that are clearly separated by distance. One cluster (green) has a higher density. Every cluster is ball-like in shape.

Figure 5.2: Visualizations of four cases of the projection of the Hepta data set into a two-dimensional space generated with [Thrun et al, 2017b].

> **Top left**: PCA projects the data without disrupting any clusters. This is the best-case scenario for a projection method. **Top right**: CCA disrupts two clusters by falsely projecting 3 points. This is the standard-case scenario.

> **Bottom left**: t-SNE does not correctly visualize the density of the data set at all, and one cluster is disrupted through the false projection of two points. Projection methods are often unable to correctly capture the density of data**. Bottom right**: When one parameter of the t-SNE algorithm is chosen incorrectly, all clusters are completely disrupted. This is the worst-case scenario for a projection method.

Figure 5.3: Chainlink data set and PCA projection generated with [Thrun et al., 2017]. The projection suffers from local backward projection error (BPE) and forward projection error (FPE) only in two small areas around a low number of points, but the visualization still shows low structure preservation.

The Chainlink data set [Ultsch, 2005c] consists of two clusters in Թଷ. Together, both clusters form intricate links in a chain and therefore cannot be separated by linear decision boundaries. Both rings are intertwined in Թଷ and have the same average distance and density (Figure5.3 left). The data lie on two well-separated manifolds; however, the global proximities contradict the local ones in the sense that the center of each ring is closer to some elements of the other class than it is to elements of its own class (for details, see chapter 9). PCA projection completely fails to preserve the structures in this data set because PCA merely rotates the data set and the discontinuities are not linearly separable.

#### **5.2 Structure Preservation**

Let ݇0, ݇∈Գ, let Γ be a connected graph, and let j be a point in a metric space M; then,

$$H\_l(k, \Gamma, M) = \{ l \in M \mid G(l, j, \Gamma) \le k \} \tag{5.1}$$

is the neighborhood set of j with k as the neighborhood extent, where ܩሺ݈, ݆, Γሻ is the minimum distance among all possible path distances (for details, see chapter 2, Eq. 1).

Suppose that there exists a pair of similar high-dimensional data points ሺ݈ூ, ݆ூሻ ∈ ܫ such that ሺ݈ூ, ݆ூሻ ∈ ܪሺ1, ߁, ܫሻ. For visualization, the goal of a projection is to match these points to the low-dimensional space R; e.g., data points in close proximity should remain in close proximity, and remote data points should stay in remote positions.

Consequently, two kinds of errors exist. The first is forward projection error (FPE), which occurs when similar data points ݈∈ܪሺ1, ߁, ܫሻ are mapped onto far-separated points

݈∉ܪሺ1, ߁, ܱሻ⋀݈ ∈ ܪሺ݇ 1, ߁, ܱሻ. The second is backward projection error (BPE), which occurs when a pair of closely neighboring positions ݈∈ܪሺ1, ߁, ܱሻ represents a pair of distant data points ݈∉ܪሺ1, ߁, ܫሻ⋀݈ ∈ ܪሺ݇ 1, ߁, ܫሻ. It should be noted that similar definitions are found in [Ultsch/Herrmann, 2005], for the case of a Euclidean graph; in [Venna et al., 2010], for the case of a KNN graph of binary neighborhoods, where BPE and FPE are referred to as

precision and recall; and in [Aupetit, 2007], for the case of a Delaunay graph, where BPE and FPE are referred to as manifold stretching and manifold compression.

Examples of BPE and FPE are shown in Figure 5.2. The PCA projection of the Hepta data set has a low FPE but a high BPE. The CCA projection has a very low BPE, but three points have high FPEs. The t-SNE (1) projection has a very high FPE, and for the t-SNE (2) projection, both the FPE and BPE are very high.

However, the FPE and BPE are not sufficient measures for evaluating projections if the goal is to estimate the number of clusters or to ensure a sound clustering of the data (e.g., Figure 5.3 right). In such a case, a suitable projection method should be able to preserve discontinuities, which occur in regions of the data space where the probability density function becomes very small. Discontinuities divide a dataset in the input space I into several clusters of similar elements represented by points ([Ultsch/Herrmann, 2005] used a similar definition).

In summary, the quality of structure preservation should be measured based on the preservation of high-dimensional discontinuities as gaps in the two-dimensional output space. Structure preservation refers to the preservation of input-space discontinuities such that no points are allowed to intrude into the corresponding discontinuity regions in the output space.

Let ݆ ∈ ܫ be an arbitrary point, and let I be projected into O by the function *proj*; then, the projection method *proj* is structure-preserving for a fixed extent ݇∈Գ if

$$proj \colon I \to \mathcal{O}, H\_j(k, \Gamma, I) \mapsto H\_j(k, \Gamma, \mathcal{O}) \,\,\forall j \in I \tag{5.2}$$

The direct neighborhoods are preserved if

$$\forall j \in I \colon H\_f(\mathbf{1}, \varGamma, I) \cap H\_f(\mathbf{1}, \varGamma, O) = \emptyset \tag{5.3}$$

The BPE and FPE are acceptable if the quality of structure preservation is high (e.g., Figure 5.3). Notably, the preservation of structure critically depends on the chosen concept of similarity. For example, a multidimensional scaling (MDS) technique may be a suitable projection method if the structure preservation depends only on a Euclidean graph. This is the case for the Hepta data set. By contrast, for the Chainlink data set, a KNN graph with a suitably chosen number of nearest neighbors could yield a better result.

In Chapter 6, it will be demonstrated that many quality criteria exist for evaluating visualizations. Given the definition of structure preservation, it is possible to group these quality measures (QMs) into semantic classes based on graph theory.

In the last section of this chapter, a visualization method with the specific aim of structure preservation is proposed.

#### **5.3 Generating a Topographic Map from the Generalized U\*-matrix**

In this section I introduce an U\*-matrix technique that is generally applicable for all projection methods and can be used to visualize both distance- and density-based structures. This visualization technique is the further development of the idea that the U-matrix can be applied to every projection method [Ultsch/Mörchen, 2006].

In this work, the visualization technique results in a topographic 3D landscape. Here, the requirements are a heavily modified emergent self-organizing map (ESOM) algorithm and a method of high-dimensional density estimation. Contrary to [Ultsch/Mörchen, 2006], the process of computing the resulting topographic map is completely free of parameter dependence and accessible by simply by downloading the corresponding R package [Thrun/Ultsch, 2017b].

#### *5.3.1 Simplified ESOM*

To calculate a U\*-matrix for any projection method, a modified ESOM algorithm is required. The first step is the computation of the correct lattice size.

On the x axis, let the lattice begin at 1 and end at a maximal number denoted by Columns C (equal to the number of columns in the lattice); similarly, on the y axis, let the lattice begin at a maximal number denoted by Lines L and end at 1. Then, the first condition is expressed as [Ultsch, 2015]

$$\frac{L-1}{\mathcal{C}-1} \approx \frac{|\max(\mathcal{y}) - \min(\mathcal{y})|}{|\max(\mathcal{x}) - \min(\mathcal{x})|} = \frac{d\mathcal{y}}{d\mathcal{y}} = \Delta \tag{l..}$$

The second condition is that the lattice size should be larger than NN27:

$$L\*\mathcal{C} \ge \text{NN} \qquad \text{(II.)}$$

The first condition (I.) implies that the lattice size should be as close to equal to the size of the coordinate system as possible. The second condition (II.) is required for emergence in our algorithm. For details, see [Ultsch, 1999]. The resulting equation to be solved is

$$L^2 + L(1+\Delta) - \text{NN} \* \Delta \ge 0\tag{5.4}$$

which yields

$$L \ge -\frac{1+\Delta}{2} + \sqrt{\left(\frac{1+\Delta}{2}\right)^2 + \text{NN} \ast \Delta} \tag{5.5}$$

After the transformation from the projected points28 ∋ܱ to points on a discrete lattice, the points are called the best-matching units (BMUs) ܾ݉ݑ ∋ ܤ ⊃ Թଶ of the high-dimensional data points j, analogous to the case for general SOM algorithms with ݂݃ݎ݅݀: ܱ → ܤ, ↦ ܾ݉ݑ*,* where *fgrid* is surjective when conditions (i) and (ii) are met.

To develop the algorithm illustrated in Listing 5.1, the idea of [Ultsch/Mörchen, 2006], in which it was suggested to "apply Self-Organizing Map training without changing the best match[ing unit] assignment", was adopted. However, in contrast to [Ultsch/Mörchen, 2006], here, the transformation *fgrid* is defined precisely to calculate the BMU positions and the structure of the lattice is toroidal; i.e., the borders of the lattice are cyclically connected [Ultsch, 1999].

Based on the relevant *symmetry considerations*29, a simplified version of ESOM (sESOM) is introduced here. No epochs or learning rate are required, because the cooling scheme is defined by a special neighborhood function h: M ൈ M ൈ Թା → ሾ0,1ሿ.

Let ܯ ൌ ሼ݉ଵ,…,݉ሽ be a set of neurons (where ݉ are the lattice positions) with the corresponding prototype set ܹ ൌ ሼݓଵ,…,ݓሽ, where dim(W)=dim(I) and #W=#M; then, the neighborhood function h is defined as

<sup>27</sup> In [Ultsch, 1999] the minimum number of 4096 neuros was proposed.

<sup>28</sup> Or DataBot positions on the hexagonal grid of Pswarm (see chapter 8).

<sup>29</sup> See chapter 8 for details.

$$\mathbf{h} = \begin{cases} 1 - \frac{\mathbf{d} \langle \mathbf{j}, \mathbf{l} \rangle^2}{\pi \mathbf{R}^2}, & \text{iff } \frac{\mathbf{d} \langle \mathbf{j}, \mathbf{l} \rangle^2}{\pi \mathbf{R}^2} < 1 \\\ 0, & \text{else} \end{cases} \tag{5.6}$$

In sESOM, learning is achieved in each step by modifying the weights in a neighborhood as follows:

$$
\Delta\mathbf{w}(R) = \mathbf{1} \ast h(bm\mathbf{u}(j), m\_l, R) \ast (j - \mathbf{w}(m\_l)) \tag{5.7}
$$

In contrast to [Ultsch/Mörchen, 2006], the algorithm does not require any input parameters, and the resulting visualization is not a two-dimensional gray-scale map but rather a topographic map with hypsometric tints [Thrun et al., 2016a]. The entire algorithm is summarized in Listing 5.1.

```
function (B, I) 
     for all ܾ݉ݑሺ݆ሻ߳ B: 
        assign the positions ݉ ∈ ܯ with random weightings ݓ߳ W on the grid 
        assign to each ܾ݉ݑሺ݆ሻ ൌ ݉ the weighting	ݓ ൌ݆∈ܫ 
     end for ܾ݉ݑሺ݆ሻ 
    for R=Rmax to 1 do 
            for all ݆	߳	ܫ: 
 ܾ݉ݑሺ݆ሻ ൌ argmin ∈ெ
                                 ሼܦሺ݆, ݓሺ݉ሻሻሽ
                										ݓ߂൫ܴ, ܾ݉ݑሺ݆ሻ൯ൌ݄ሺܾ݉ݑሺ݆ሻ, ݉, ܴሻ ∗ ሺ݆ െ ݓሺ݉ሻሻ
                        for all ݓሺ݉ሻ ∈ ݄ሺܾ݉ݑሺ݈ሻ, ݉, ܴሻ
                                w´(݉) = w(݉) + Δw(R,bmu(l)) 
                        end for w(݉) 
            end for ݆	߳	ܫ
            for all ܾ݉ݑሺ݆ሻ߳	B: 
                   assign to each ܾ݉ݑሺ݆ሻ ൌ ݉ the weighting	ݓ ൌ݆∈ܫ 
      end for R
```
*end function* 

Listing 5.1: sESOM pseudocode algorithm implements a stepwise iteration from the maximum radius Rmax which is given by the lattice size (Rmax = C/6) stepwise with one per step and down to 1. *w´(*݉*)* indicates that the prototype *w(*݉*)* of neuron ݉ is modified by Eq. 5.7 Additionally, the search for a new best matching unit still is used and these prototypes may change during one iteration. The predefined prototypes are reset to the weights of their corresponding highdimensional data points after each iteration.

#### *5.3.2 U\*-Matrix Calculation*

After sESOM projection, the structure of the input data emerges when a visualization technique called U-matrix is applied. A U-matrix represents a folding of the high-dimensional space in which each receptive field is called a U-height. Let *N(j)* be the eight immediate neighbors of ݆݉∈ܯ, and let ݓ݆∋ܹ be the prototype corresponding to ݆݉; then, the average of all distances between ݓ݆ and the other prototypes ݓ݅ is called the U-height corresponding to the position ݆݉:

$$u(j) = \frac{1}{n} \sum\_{l \in N(j)} D(w\_l, w\_j), \qquad n = |N(j)| \tag{5.8}$$

To explain the visualization technique for the sESOM algorithm, in this section and in section 5.3.3 below, [Thrun et al., 2016a] is cited:

*"The U-matrix is the display of values* ݑሺ݆ሻ *through proportional intensities of grey shades [Ultsch, 2003a]. By formalizing the displayed structures, [Lötsch/Ultsch, 2014] showed that the U-matrix is an approximation of [the] Voronoi borders of the high-dimensional points in the output space" (see chapter 4.2.0).* 

Therefore, the generalized U-matrix can be normalized [using] the generalized abstract U-matrix.

*"In addition to the U-matrix, [Ultsch, 2003c] introduced the high-dimensional density visualization technique called P-matrix, where P-heights on top of the receptive fields are displayed. The P-height* ሺ݉ሻ *for a position* ݉ *is a measure of the density of data points in the vicinity of* ݓሺ݉ሻ:

$$p(m\_j) = |\{i \in I | D\{i, \mathbf{w}(m\_j)\} < r > 0, r \in \mathbb{R}\ \}|\tag{5.9}$$

*The P-height is the number of data points within a hypersphere of radius r. Here, we choose the interval ϱ of the radius with* 

$$\varrho \in \left[ \operatorname{median} \{ \mathcal{C}(D) \} , \operatorname{median} \{ \mathcal{A}(D) \} \right], \tag{5.10}$$

*where D [represents] all input space distances and A(D) is the group A of distances calculated by [the] ABC analysis [Ultsch/Lötsch, 2015]. ABC analysis30 tries to identify the optimum information that can be validly retrieved by using concepts developed in economic sciences. In particular, [these] concepts are used in the search for a minimum possible effort that gives the maximum yield [Ultsch/Lötsch, 2015]. The distances are divided into three disjoint subsets A, B and C, with subset A comprising [the] largest values ("outer cluster distances"), subset B comprising values where the yield equals the effort required to obtain it, and the subset C comprising [] the smallest values ("inner cluster distances"). We suggest [choosing] the specific radius r [based on] the [ratio] v of [the] inter- versus intracluster distances[,] estimated [as]* 

$$\upsilon = \frac{\max\{\mathcal{C}(\mathcal{D})\}}{\min\{\mathcal{A}(\mathcal{D})\}} \qquad \qquad \text{(5.11)}$$

*The radius r is estimated [as]* ݎ ൌ ݒ ∗ 20ሺܦሻ*, where* 20ሺܦሻ *is [the] 20-th percentile of [the] input distances [Ultsch, 2003b]. From this starting point, the user may search interactively for the empirical Pareto percentile [that] defines the radius r (see [the] R package Umatrix).* 

*The combination of a U-matrix and a P-matrix is called [a] U\*-matrix [Ultsch et al., 2016a]. It can be formalized as [a] pointwise matrix [product]:* ܷ<sup>∗</sup> ൌ ܷ ∗ ܨሺܲሻ*, where F(P) is a matrix of factors f(p) that are determined through a linear function f on the P-heights p [in] the P-matrix. The function f is calculated so that f(p) = 1 if p is equal to the median and f(p) = 0 if p is equal to the 95-[th] percentile (p95) of the heights in the P-matrix. For p(j) > p95, f(p) = 0, which indicates that j is well within a cluster and results in [a height of zero] in the U\*-matrix." [Thrun et al., 2016a]*

#### *5.3.3 Topographic Map with Hypsometric Tints*

The U\*-matrix visualization technique produces a topographic map with hypsometric tints [Thrun et al., 2016a]. Hypsometric tints are surface colors that represent ranges of elevation [Patterson/Kelso, 2004]. Here, a specific color scale is combined with contour lines.

The color scale is chosen to display various valleys, ridges and basins: blue colors indicate small distances (sea level), green and brown colors indicate middle distances (low hills), and white colors indicate large distances (high mountains covered with snow and ice). Valleys and

<sup>30</sup> For usage see CRAN R package ABCanalysis [Thrun et al. 2015].

basins represent clusters, and the watersheds of hills and mountains represent the borders between clusters (Figure 5.1 and Figure 5.4).

The landscape consists of receptive fields, which correspond to certain U\*-height intervals with edges delineated by contours. This work proposes the following approach (see [Thrun et al., 2016a, p. 10]): First, the range of U\*-heights is split up into intervals, which are assigned uniformly and continuously to the color scale described above through robust normalization [Milligan/Cooper, 1988]. In the next step, the color scale is interpolated based on the corresponding CIELab color space [Colorimetry, 2004]. The largest possible contiguous areas corresponding to receptive fields in the same U\*-height intervals are outlined in black to form contours. Consequently, a receptive field corresponds to one color displayed in one particular location in the U\*-matrix visualization within a height-dependent contour. Let u(j) denote the U\*-heights, and let q01 and q99 denote the first and 99-th percentiles, respectively, of the U\*-heights; then, the robust normalization of the U\*-heights u(j) is defined by

 ݑሺ݆ሻ ൌ ௨ሺሻିଵ ଽଽିଵ ሺ5.12ሻ 

The number of intervals in is defined by

$$\frac{1}{in} = \frac{q01}{q99} \qquad \qquad \text{(5.13)}$$

The resulting visualization consists of a hierarchy of areas of different height levels represented by corresponding colors (see Figure 5.4). To the human eye, the visualization using the generalized U-matrix tool is analogous to a topographic map; therefore, one can visually interpret the presented data structures in an intuitive manner. In contrast to other SOM visualizations, e.g., [K. Tasdemir/Merenyi, 2009], this topographic map presentation enables the layman to interpret sESOM results.

The use of a toroidal map for sESOM computations necessitates a tiled landscape display in the interactive U-matrix tool [Thrun et al., 2015], which means that every receptive field is shown four times. Consequently, in the first step, the visualization consists of four adjoining images of the same U-matrix [Ultsch, 2003a] (the same is true for the U\*-matrix). To obtain the 3D landscape (island31), [Thrun et al., 2016a, p. 10] proposed to rectangularly cut the tiled U\* matrix visualization as follows.

Let ݒ௦ and ݒ௨௦ be the vectors of the row and column sums, respectively, of the U\* heights, and let ܾ௦ (ܾ௨௦) be the number of BMUs in the corresponding row line of ݒ௦ (ݒ௨௦); then, we define the upper border as up ൌ max ሺݒ௦/݂ሺܾ௦ሻ), the left border as lb ൌ maxሺܾ௨௦/݂ሺݒ௨௦)) and the other two borders based on the length and width of the U\*-matrix, where the vector *f(b)* is the sum ݂ሺܾሻ ൌ ܾ ܾܾ with

ܾ ൌ ሺܾ, ܾଵ,…,ܾିଵሻ and ܾ ൌ ሺܾଶ,…,ܾାଵሻ for a toroidal lattice. For better comprehensibility, see the axes in [Thrun et al., 2016a, p. 14, Fig. 1], which are defined from one to ݉ܽݔሺܮ݅݊݁ݏሻ and from one to ݉ܽݔሺܥ݈ݑ݉݊ݏሻ.

<sup>31</sup> An island can be also cut interactively (or the the cutting may be improved) and thus may not be rectangular

Figure 5.4: Topographic map of the PCA projection of the Chainlink data set. The discontinuities between the clusters are misrepresented.

Figure 5.5: Zoomed-in view of the misrepresentation of the discontinuities in the PCA projection of the Chainlink data set to better visualize the BPE and FPE.

Figure 5.6: Topographic maps can depict the discontinuities in high-dimensional data sets: clusters lie in valleys and are separated by hills. However, the introduction of spurious gaps between projected points (the disruption of clusters) cannot be seen using this approach. **Top**: topographic map of CCA projection [Demartines/Hérault, 1995] of the Chainlink data set. **Middle**: topographic map of ESOM projection [Ultsch, 1999] of the Atom data set. **Bottom**: island of NeRV projection [Venna et al., 2010] of the leukemia data set. All results are trial-dependent because the projection methods are stochastic. Sometimes, the annealing scheme (in CCA or ESOM) or the random initialization process (in NeRV) fails.

#### *5.3.4 Limitations*

The generalized U\*-matrix visualization by a topographic map is capable of visualizing BPEs and FPEs. For example, this is shown in Figure 5.5. The projected points in the output space with low BPE/FPE values lie in sea regions. If the BPE/FPE around a projected point is high, then the visualization generates a mountain at this point (Figure 5.5). However, the topographic map has certain limitations (Figure 5.6). When the default parameters in CCA are used to analyze the Chainlink data set (see [Thrun et al., 2017]) or when the default ESOM parameters ([Thrun et al., 2016b]) are used to analyze the Atom data set, clusters are sometimes disrupted because additional gaps are added that cause points to intrude into the discontinuity regions between clusters.

Another question that arises in this chapter from the examples of the CCA and ESOM projections of the Chainlink and Atom data sets, respectively, in Figure 5.6 is the question of how to handle stochastic projection methods in which the visualization is trial-dependent. The annealing schemes used in the ESOM and CCA algorithms may be relevant here. The annealing process depends on certain parameters and may not yield structure-preserving projections, as shown in the examples in Figure 5.6 The Neighborhood Retrieval Visualizer (NeRV) projection of the leukemia data set presented in Figure 5.6 further illustrates the problem of the correct choice of parameters, which is typically very challenging. In this case, the NeRV projection is sensitive to the initialization parameters, especially to the seed used for the random number generator. In chapter 9, an additional example will be presented to demonstrate that NeRV requires the weighting between precision and recall to be correctly chosen for high-dimensional structures to be preserved.

Hence, the next chapter will focus on the search for a QM that may be able to measure structure preservation instead of attempting to visualize it.

License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. **Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **6 Quality Assessments of Visualizations**

Dimensionality reduction techniques reduce the dimensions of the input space to facilitate the exploration of structures in high-dimensional data. Two general dimensionality reduction approaches exist: manifold learning and projection. Manifold learning methods attempt to find sub-spaces in which the high-dimensional distances are preserved. Usually, these sub-spaces have more than two dimensions.

It was argued in [Venna et al., 2010] that manifold learning methods are not very useful for information visualization because they are designed simply to find a manifold, and L. J. van der Maaten et al. demonstrated that they do not outperform classical principal component analysis (PCA) for real-world tasks [L. J. van der Maaten et al., 2009].

This work focuses on two-dimensional visualizations of high-dimensional data, with the intention of making the visualizations easily understandable, because it is difficult for humans to get a spatial sense of more than three dimensions. A valid visualization is possible if a projection method creates an image of the structure of high-dimensional data. The two-dimensional scatter plot remains a state-of-the-art form of visualization used in cluster analysis (e.g., [Everitt et al., 2001, pp. 31-32; Hennig et al., 2015, pp. 119-120, 683-684; Mirkin, 2005, p. 25; G. Ritter, 2014, p. 223]). Consequently, the aim here is to evaluate two-dimensional visualizations of high-dimensional data in which the structures are defined by discontinuities. In short, projection methods should preserve the structures defined by natural clusters.

However, as a consequence of limiting the output space to two dimensions, the low-dimensional similarities cannot completely represent the high-dimensional distances, which can result in a misleading interpretation of the underlying structures; these structures can be evaluated using quality measures (QMs), and the first step in the process of assessing the performance of projection methods is to assess these measures themselves. Here, the QMs are assessed based on the proposed concept of structure preservation, namely, the preservation of high-dimensional discontinuities related to compact or connected structures (see chapter 3, section 3.2.1, for details). Overall, 19 QMs will be categorized into semantic groups in this chapter, and their advantages and disadvantages will be discussed.

To date, QMs have mostly been applied to data sets such as a Swiss roll shape [L. Van der Maaten et al., 2009] [Mokbel et al., 2013], an s-shape [Yin, 2007] or a sphere [Venna et al., 2010], for which the problem lies only in the visual representation of an object that is continuous in more than two dimensions. Recently, [Gracia et al.] conducted a study on a number of QMs based on 12 real-world data sets. The research team's analysis of the QMs concentrated on the correlations between them [Gracia et al., 2014]. This study illustrates the other common evaluation approach: the use of various natural high-dimensional data sets for which prior classifications are available. However, with the exception of the classification error (CE) (see section 2), this information is not used in the evaluation of projection methods, e.g., [Bunte et al., 2012]. Moreover, it is not stated whether the classification is defined based on discontinuities or the prior knowledge of a domain expert. Whether these data sets possess discontinuities is not discussed.

Serving as an illustration of this problem, Figure 6.1 presents projections of a high-dimensional data set called the leukemia data set. In addition, above each plot in Figure 6.1, the CE for 7 nearest neighbors is provided. The leukemia data set was introduced in chapter 3, where it was shown that common clustering algorithms are unable to reproduce its prior classification. The question arises of whether the existing QMs are able to distinguish among different projections with regard to their preservation of the discontinuities in this data set (see chapter 3.3, Figure 3.6 and 3.7). As an example, Figure 6.2 shows the often used trustworthiness and discontinuity (T&D) measures [Venna/Kaski, 2001] and precision and recall measures [Venna et al., 2010] for this data set. The distinction among the six projections in terms of quality, based on these measures, is debatable.

Figure 6.1: Projections of the leukemia data set generated using common methods and the corresponding classification errors (CEs, see 6.1.1 for def.) for 7 nearest neighbors CE(k=7). The colors represent the predefined illness cluster labels. The clusters are separated by discontinuities in the highdimensional space (see chapter 3). Emergent self-organizing map (ESOM) is the projection method that best preserves the discontinuities in this data set. The Neighborhood Retrieval Visualizer (NeRV) algorithm splits the smallest cluster into two roughly equal parts.

Figure 6.2: Trustworthiness and discontinuity (T&D) measures (def. see 6.1.13 on p. 65) and precision and recall measures (def. see 6.1.8 on p. 68) for the six projections shown in Figure 6.1 of the leukemia data set. The discontinuity is highest for Sammon mapping and NeRV (top left), as is the trustworthiness (top right). However, in the case of the trustworthiness, the outcome depends on the number of nearest neighbors considered, k; for a low value, ESOM is superior to Sammon mapping, and for a high value, principal component analysis (PCA) overtakes NeRV. In terms of the smoothed precision and recall [Venna et al., 2010], NeRV and PCA achieve the best values. Without the scatter plots in Figure 6.1, interpretation of the results of this figure is difficult.

This example illustrates that the evaluation of projections of real-world, high-dimensional data sets, and consequently the evaluation of QMs, is a challenging task. To simplify the problem, two elementary artificial three-dimensional data set32 will be used to aid in assessing QMs (results in supplement A). Both data sets are clearly defined based on the discontinuities, which some projection methods fail to project into two dimensions (see supplement A). In the second section of this chapter, the definitions of neighborhoods from the perspective of graph theory (chapter 2) will enable a deeper understanding of the various types of QMs.

In the last section, a new QM called the Delaunay classification error (DCE) will be introduced, which requires a prior classification of the data set of interest and is inspired by recent SOM research [Lötsch/Ultsch, 2014] on the structures of the U-matrix. In the previous chapter, a method that allows the U-matrix to be computed for any projection method was proposed.

#### **6.1 Common Quality Measures (QMs)**

In this section, the well-known measures for assessing the quality of projections are introduced in alphabetical order. Some QMs use the ranks of distances ܴሺ݆, ݈ሻሻ instead of the actual distances ܦሺ݆, ݈ሻ between points. In this case, the following shorthand notation will be used. Let ܦሺ݆, ݈ሻ be an entry in the matrix ܦேൈே of the distances between all *N* points in a metric space *M*, where ݆, ݈ ∈ ܯ ;then, the rank ܴሺܦሺ݆, ݈ሻሻ ൌ ݕ ∋ ሼ1, … , ݊ሽ denotes the ݕ௧ position in the consecutive sequence of all entries of this matrix arranged in value from smallest to greatest. In short, the ranks of the distances are the relative positions of the distances, where *R* denotes the ranks of the distances in the input space and *r* denotes the ranks of the distances in the output space. Occasionally, ranks are represented by a vector in which the entries are the ranks of the distances between one specific point and all other points. Typically, the matrix or vector of ranks is normalized such that the values of its entries lie between zero and one.

#### *6.1.1 Classification Error (CE)*

This type of error is often used to compare projection methods when a prior classification is given [Bunte et al., 2012; Gracia et al., 2014; L. J. van der Maaten et al., 2009; Venna et al., 2010].

Each point ݈∈ܱ in the output space is classified by a majority vote among its k nearest neighbors in the visualization [Venna et al., 2010], although sometimes simply the cluster of the nearest neighbor is chosen. This classification is compared with the prior classification as follows: Let ܿ∈ܥ denote the classification of the points ݆∈ܫ in the input space, where ܥሺܫሻ denotes a cluster of the classification in I. Let ݈∈ܱ denote the projected points in the output space that map to I. Let ܪሺknn, ܭ, ܱሻ be the neighborhood of ݆ in a KNN graph in the output space. Then, the clusters are sorted and the clusters with the largest number of points is chosen: If ൛݈ ∈ ܪሺknn, ܭ, ܱሻห ∀ ݈ଵ,…,݈, หܥభሺ݈ሻห ൏ หܥమሺ݈ሻห ൏. . . ൏ หܥೖሺ݈ሻหሽ, then

ܥሺܱሻ ൌ ሼܥೖሺ݈ሻ}. The label ܥሺܱሻ is then compared with ܥሺܫሻ. This yields the error

$$F = \frac{1}{N} \sum\_{f=1}^{N} \left| \mathcal{C}\_f(O) \right. \tag{6.1}$$

<sup>32</sup> One with compact structures, one with connected structures.

#### *6.1.2 C Measure*

The C measure is a product of the input and output spaces in terms of similarity functions [Goodhill et al., 1995]. For ease of comparison, in (6.4), the similarity function is redefined as the distance between two points. Consequently, the C measure is defined based on a Euclidean graph.

In the equation below, C is replaced with the capital letter F.

$$F = \sum\_{j} \sum\_{l} D(j, l) \cdot d(j, l) \tag{6.2}$$

A high value of the C measure indicates good neighborhood preservation. It is evident from Eq. 6.2 that F is at a maximum when the ranks of the distances in the spaces I and O are equivalent. No normalization of the F value is given.

#### *6.1.3 Two Variants of the C Measure: Minimal Path Length and Minimal Wiring*

Eq. 6.3 presents the definition of the minimal path length [Durbin/Mitchison, 1990], and Eq. 6.4 gives the definition of the minimal wiring [Mitchison, 1995]:

$$F = \sum\_{j,l} D(j,l) \cdot s(j,l) \tag{6.3}$$

$$F = \sum\_{j,l} d(j,l) \cdot s(j,l) \tag{6.4}$$

Where Eq. (*I)* with ݏሺ݇, ݆ሻ defines the k nearest neighbors. Thus, it is analogous to a KNN graph**:**

$$\mathbf{s}(j,l) = \begin{cases} \mathbf{1}, & \text{if } j \in H (kmn = \mathbf{1}, M) \\ \mathbf{0}, & \text{otherwise} \end{cases} \tag{I}$$

where in (Eq. 6.3), M=I to define the set of the nearest spatial neighbors in the input space I, and in (Eq 6.4), ܯൌܱ to serve the same purpose for the output space. A smaller value of the error F indicates a better projection.

#### *6.1.4 Force Approach Error*

According to the force approach concept presented in [Tejada et al., 2003], the relation between the distances ܦሺ݆, ݈ሻ and dሺj, lሻ should be constant for each pair of adjacent data points. The force approach attempts to separate data points that are projected too close to one another and to bring together those that are too scattered. In [Tejada et al., 2003], it was suggested that it is possible to improve any projection method by the following means.

First, for each pair of projected points ሺݓ, ݓሻ, the vector ݒఫ ሬሬሬሬԦ ൌ ݓ െ ݓ is calculated if ݓ is a direct neighbor of ݓ ;then, a perturbation in the direction of ݒఫ ሬሬሬሬԦ is applied. Consequently, ݓ is moved in the direction of ݒሬሬሬ ఫሬሬԦ by the fraction defined in (5a). When all points ݓ have thus been improved, a new iteration begins.

$$\Delta\_l = \frac{D(\mathbf{j}, l) - D\_{m\ell n}}{D\_{max} - D\_{m\ell n}} - \mathbf{d}'(\mathbf{j}, l) \tag{6.5'}$$

Note that all distances ܦሺ݆, ݈ሻ are normalized only once. For performance reasons, the projected points are normalized in every iteration instead of the dሺj, lሻ. The error on the projected points is defined as

$$F = \frac{1}{M} \sum\_{l=1}^{N} |\Delta\_l|\qquad\qquad\text{(6.5)}$$

Thus, as shown in Eq. 6.5´, the force approach error is defined with respect to a Euclidean graph, and an *F* value of zero suggests optimal neighborhood preservation, as seen from Eq. 6.5. A similar approach, referred to as point compression and point stretching, was proposed in [Aupetit, 2007], where it was used for the visualization of errors with the aid of Voronoi cells.

#### *6.1.5 König's Measure*

König's measure is a rank-based measure introduced in [König et al., 1994]:

$$F(kmn) = \frac{1}{3kmn\*N} \sum\_{j=1}^{N} q\_c(j, kmn) \tag{6.6}$$

with ݍas in Eq. *I*

$$q\_c(j, km) = \begin{cases} 3, & \text{if } R(j, l) = r(j, l) \text{ and } \ l \in H\_j(kmn, l) \cap H\_l(kmn, O) \\ 2, & \text{if } l \in H\_j(kmn, l) \cap H\_l(kmn, O) \\ 1, & \text{if } l \in H\_j(kmn, l) \cap H\_j(c, O), km < c \\ 0, & \text{otherwise} \end{cases} \tag{I}$$

König's measure is controlled by the following parameters: a constant parameter c and a variable parameter representing the neighborhood size, ݇݊݊ ∈ ሼ1, . . , ݇݊݊|݇݊݊ ൏ ܿሽ, which must be smaller than c.

In the first case, the ranks place l in the same knn neighborhood with respect to j in both the input and output spaces. In the second case, the sequence in the neighborhood may be different, but ݈∈ܱ is still within the first knn ranks relative to j in the current neighborhood defined by the value of knn. In the third case, the point l lies in a larger, constant neighborhood of ܪሺܿ, ܱሻ. The range of F is between zero and one, where a value of one indicates perfect structure preservation and a value of zero indicates poor structure preservation [König, 2000]. The parameters ݇݊݊ and c were investigated by [Karbauskaitė/Dzemyda, 2009]. The results indicated that c does not have a strong influence on the value of F; F changes only for large knn values. Moreover, [Karbauskaitė/Dzemyda, 2009] showed that the parameter ݇ଵ influences only the magnitude of the F value, whereas the form of F(knn) remains approximately the same.

#### *6.1.6 Local Continuity Meta-Criterion (LCMC)*

The local continuity meta-criterion (LCMC) was introduced in [Chen/Buja, 2006]; note that a similar idea was independently adopted by [Akkucuk/Carroll, 2006]. Because the correlation between these two measures is very high [Gracia et al., 2014]), only the LCMC is introduced here. The LCMC is defined as the average size of the overlap between neighborhoods consisting of k nearest neighbors in I and O [Chen/Buja, 2009]. For each ݔ ∋ ܫ and w ∈ ܱ, there exist corresponding sets of points in the neighborhoods ܪሺ݇݊݊, ܫሻ and ܪሺ݇݊݊, ܱሻ, which are calculated using a given knn in a KNN graph. The overlap is measured in a pointwise manner:

$$A(j) = \begin{vmatrix} H\_l(kmn, l) \ \cap H\_l(kmn, \mathcal{O}) \end{vmatrix}, \qquad \overline{A\_{kmn}} = \frac{1}{N} \sum\_{j=1}^{N} A(j) \tag{6.7}$$

In Eq. 6.7´, a global measure is obtained by averaging all N cases [Chen/Buja, 2009]. The mean ܣ തതതതതത is normalized with respect to *knn* because this value is the upper bound on ܣ തതതതതത. Eq. 6.7 is also adjusted by means of a baseline term representing a random neighborhood overlap, which is obtained by modeling a hypergeometric distribution with *knn* defectives out of N-1 items, from which *knn* items are drawn:

$$F(kmn) = \frac{1}{kmn} \overline{A\_{kmn}} - \frac{kmn}{N-1} \tag{6.7}$$

In contrast to the T&D measures and the mean relative rank error (MRRE; see the next section), the LCMC is calculated based on desired behavior [Lee/Verleysen, 2009]. The cited authors also showed that the LCMC can be expressed as a special case of the co-ranking matrix.

#### *6.1.7 Mean Relative Rank Error (MRRE) and the Co-ranking Matrix*

The MRRE was introduced in [Lee/Verleysen, 2007, p. 214] and is defined as follows:

$$F\_1(knn) = \frac{1}{N(knn)} \* \sum\_{j} \sum\_{l \in \text{H}(kmn, \text{O})} \frac{|R(j, l) - r(j, l)|}{R(j, l)}\tag{6.8a}$$

$$F\_2(knn) = \frac{1}{N(knn)} \* \sum\_{j} \sum\_{l \in \text{H}(kmn, \text{I})} \frac{|R(j, l) - r(j, l)|}{r(j, l)}\tag{6.8b}$$

The normalization is given by ܰሺ݇݊݊ሻ ൌ ܰ ∑ |ேିଶାଵ| ୀଵ , which represents the worst case. There are notable similarities between the MRRE and the T&D measures: both types of measures use the ranks of the distances and KNN graphs to calculate overlaps, but, in addition to the different weightings, the MRRE also measures changes in the order of positions in a neighborhood *H(knn, I)* or *H(knn, O)*. Both position changes and intruding/extruding points are considered, but position changes are weighted more heavily than intrusion/extrusion. The MRRE (and T&D and LCMC, as well) can be abstracted using the co-ranking matrix framework as follows.

As introduced in [Lee/Verleysen, 2008], ܳൌݍ,ଵஸ,ஸேିଵ is a matrix in which each element is equal to the number of pairs of points that lie in neighborhoods defined by the same or different values of knn. For example, ݍ ൌ หܪሺ݅, ݇݊݊, ܫሻ ∩ܪሺ݇, ݇݊݊, ܱሻ ห represents the upper left block of the co-ranking matrix for a specific knn. Formally, Q is a sum of N permutation matrices; hence, ∑ ݍ ேିଵ ୀଵ ൌ ∑ ݍ ேିଵ ୀଵ ൌ ܰ. It was shown in [Lee/Verleysen, 2009] that the MRRE can be rewritten as two alternative quantities characterizing a projection

ܳெோோாሺܭሻ ൌ 1 െ ிభାிమ <sup>ଶ</sup> , which the authors call the quality of the projection, and ܤெோோாሺܭሻ ൌ ܨଵ െ ܨଶ, called the behavior (for details, see [Lee/Verleysen, 2009]).

#### *6.1.8 Precision and Recall*

[Venna et al., 2010] reintroduced the idea of misses used by [Ultsch/Herrmann, 2005], where misses are similar data points ሺ݈ூ, ݆ூ) ∈ ݅ that are mapped to far-separated points ሺlை, jைሻ ∈ ܱ

[Ultsch/Herrmann, 2005]. Conversely, if a pair of closely neighboring positions ሺlை, jைሻ represents a pair of distant data points, then this pair is called a false positive. From the information retrieval perspective, this approach allows one to define the precision and recall for the case in which the neighborhoods are merely binary. However, [Venna et al., 2010] goes a step further by replacing such binary neighborhoods with probabilistic ones, which are loosely inspired by stochastic neighbor embedding [Hinton/Roweis, 2002]. The neighborhood of the point l is defined with respect to the relevance of the points ݆∈ܫ around l:

$$p\_l(j) = \frac{\exp(-\frac{D(l,j)^2}{\sigma\_l^2})}{\Sigma\_{k \neq j} \exp(-\frac{D(l,k)^2}{\sigma\_l^2})} \qquad\qquad (l)$$

where ߪ is set to the value for which the entropy of ሺ݆ሻ is equal to log(knn) and knn is a rough upper limit on the number of relevant neighbors and is set by the user [Venna et al., 2010]. The authors propose a default value of 20 effective nearest neighbors. Similarly, the corresponding neighborhood in the output space is defined as

$$q\_l(j) = \frac{\exp\left(-\frac{d\left(l, j\right)^2}{\sigma\_l^2}\right)}{\sum\_{k \neq j} \exp\left(-\frac{d\left(l, k\right)^2}{\sigma\_l^2}\right)}\qquad\qquad\text{(II)}$$

These neighborhoods are compared based on the Kullback-Leibler divergence (KLD). Applying *(I)* and *(II)* KLD is used to define the precision ܨ and recall ܨோ:

$$F\_R = -\frac{1}{N} \sum\_{l}^{N} \sum\_{f \neq l} p\_f(l) \log \left( \frac{p\_f(l)}{q\_f(l)} \right) \tag{6.9a}$$

$$F\_P = -\frac{1}{N} \sum\_{l}^{N} \sum\_{f \neq l} q\_f(l) \log \left( \frac{q\_f(l)}{p\_f(l)} \right) \tag{6.9b}$$

The precision and recall are plotted using a receiver operating characteristic (ROC)-like approach, in which the negative definition of the values results in the best projection method being displayed in the top right corner. The authors call this measure smoothed because it is not normalized, and theyalso propose a normalized version, with values lying between zero and one, based on ranks instead of distances. Note that the KLD and the symmetric KLD do not follow the triangle inequality for metric spaces.

#### *6.1.9 Rescaled Average Agreement Rate (RAAR)*

The average agreement rate is defined in Eq. *I* as

$$Q(knn) = \frac{1}{N} \sum\_{j=1}^{N} \frac{\left| H\_j(knn, I) \cap H\_j(knn, O) \right.}{knn} \tag{6.10}$$

in [Lee et al., 2014], analogously to the LCMC, using the unified co-ranking framework [Lee/Verleysen, 2008], in which the T&D, MRRE, and LCMC measures can all be summarized mathematically (for further details, see [Lee/Verleysen, 2009]). [Lee et al., 2014] argues that to enable fair comparisons or combinations of values of Q(knn) for different neighborhood sizes, the measure in Eq. 6.10 must be rescaled to

$$F(knn) = \frac{(N-1)Q(knn) - knn}{N - 1 - knn}, 1 \le kmn \le N - 2 \qquad \text{(6.10')}$$

This quantity is called the rescaled average agreement rate (RAAR). The values of F lie in the interval between zero and one, with a logarithmic knn scale and a scalar value that can be obtained by calculating the area under the curve (AUC).

#### *6.1.10 Stress and the Shepard Diagram*

The original multidimensional scaling (MDS) measure has various limitations, such as difficulties with handling non-linearities (see [Shepard, 1980] for a review); moreover, the underlying metric must be Euclidean, and Sammon mapping is simply a normalized version of MDS. Therefore, only non-metric MDS is considered here. The calculated evaluation measure is known as the stress and was first introduced in [Kruskal, 1964a]. Here, the stress F is defined as shown in Eq. 6.11. The disparities ߦ, are the target values for each ݀ሺ݆, ݈ሻ, meaning that if the distances in the output space achieve these values, then the ordering of the distances is preserved between the input and output spaces [Goodhill et al., 1995, pp. 8-9].

$$F = \sqrt{\frac{\sum\_{j \neq l} \{D(j, l) - \xi\_{l, l}\}^2}{\sum\_{j \neq l} D(j, l)^2}} \tag{6.11}$$

The input-space distances are used to define this measure based on a Euclidean graph. Several algorithms exist for calculating ߦ,.] Kruskal, 1964a] himself regarded *F* as a sort of residual sum of squares. A smaller value of *F* indicates a better fit. Therefore, perfect neighborhood preservation is achieved when *F* is equal to zero [Kruskal, 1964a]. The author describes *F* in terms of percentages, where values below 5% imply good neighborhood preservation. *F* can be described as the deviation from a perfect scatter plot of the distances in I versus the distances in O. This scatter plot is known as the Shepard diagram [Shepard, 1980 Fig 1C].

Here, the use of a density plot based on Pareto density estimation (PDE) [Ultsch, 2005b], instead of a scatter plot, is proposed. The author also proposes calculating Kendall's ߬ for these density plots.

#### *6.1.11 Topographic Product*

The topographic product [Bauer/Pawelzik, 1992] and an improved version thereof [Revuelta et al., 2004] were originally defined for neural maps, but in contrast to the quantization error [Uriarte/Martín, 2005] and the topographic error [Kiviluoto, 1996], it is possible to generalize the idea of the topographic product to all projection methods. Let the points l ∈ Hሺknnሺjሻ, Mሻ constitute the neighborhood of a point j in a metric space M defined based on a KNN graph and sorted in ascending order of knn; then,

$$q(j, kmn) = \frac{d(j, l\_I)}{d(j, l\_O)}\qquad\qquad\text{(I)}$$

$$Q(j, kmn) = \frac{D(j, l\_I)}{D(j, l\_O)}\qquad\qquad\text{(II)}$$

Q represents the distance between the point j ∈ ܫ and the k-th nearest neighbor ݈ூ ∈ ܫ in the input space I divided by the distance between the point j ∈ ܫ and the point ݈ை ∈ ܫ corresponding to the k-th nearest neighbor in O. Now, the product of *q* and *Q* of *(I)* and *(II)* for all orders knn can be calculated in Eq. 6.12:

$$P(j,n) = \left(\prod\_{kmn=1}^{n} q(j,kmn) \* Q(j,kmn)\right)^{\frac{1}{2n}}\tag{6.12}$$

The resulting QM is then defined as

$$F = \frac{1}{N(N-1)} \sum\_{j}^{N} \sum\_{kmn}^{N-1} \log \{ P(j, kmn) \} \tag{6.12'}$$

F takes different values depending on whether the dimension of the output space is smaller than (F<0), similar to (Fൎ0) or greater than (F>0) the dimension of the input space [Revuelta et al., 2004]. Thus, in our case, F is always smaller than zero. [Revuelta et al., 2004] improved the topographic product by using the shortest-path distances in a Euclidean graph (geodesic distances) in Eq. *(I´)* and *(II´)* instead of the direct distances of Eq. *(I)* and *(II)*:

$$q(j, kmn) = \frac{g(j, l\_l)}{g(j, l\_o)}\qquad\qquad\text{(I')}$$

$$Q(j, kmn) = \frac{G(j, l\_l)}{G(j, l\_O)}\qquad\qquad\text{(II')}$$

#### *6.1.12 Topographic Function (TF)*

The topographic function (TF) for SOMs was introduced in [Villmann et al., 1994]. This measure operates on Voronoi tessellations [Toussaint, 1980]. The TF quantifies the identity of the Delaunay graphs in I and O [Herrmann, 2011]. This work follows the general definitions found in [Villmann et al., 1997], where the TF is defined as given in Eq. 6.13 (denoted by *F*), with ്݄0 being the cardinality of ܱ or ܫ:

$$F(h) = \frac{1}{N} \sum\_{j=1, j \in I}^{N} \phi(j, h) \qquad h \neq 0 \tag{6.13}$$

$$\phi(j, h) = \#\{\forall l \in I \colon g(l, j, \mathcal{D}) > h \land G(l, j, \mathcal{D}) = 1\}, \ h > 0 \tag{6.13a}$$

$$\phi(j, h) = \#\{\forall l \in I \colon g(l, j, \mathcal{D}) = 1 \land G(l, j, \mathcal{D}) > |h|\}, h < 0 \tag{6.13b}$$

The shortest path in the Delaunay graph of the input space between the data points ሺl, j) ∈ ܫ is denoted by ܩሺ݈, ݆, ࣞሻ, and that between the projected points ሺl, j) ∈ ܱ is denoted by ݃ሺ݈, ݆, ࣞሻ. The Delaunay-graph distances *G* and *g* are equal to the number of Voronoi cells between the two points. If *h* is greater than zero, then ሺl, j) ∈ ܫ are neighbors in the input space, and if *h* is smaller than zero, then ሺl, j) ∈ ܱ are neighbors in the output space.

In Eq. 6.13a, represents the number of neighbors surrounding a data point ݆∈ܫ at a Delaunay distance greater than *h,* with the restriction that only the projected points ݈∈ܱ that are located in adjacent Voronoi cells in O are considered. 

The converse situation is considered in Eq. 6.13b: represents the number of neighbors surrounding a projected point ݆∈ܱ at a Delaunay distance greater than h, with the restriction that only the data points ݈∈ܫ that are located in adjacent Voronoi cells in *I* are considered. 

In summary, the shape of ܨሺ݄ሻ enables a detailed discussion of the magnitude of distortions occurring in O [Bauer et al., 1999]: "Small values of h indicate that there are only local dimensional conflicts, whereas large values indicate the global character of a dimensional conflict*"* [Villmann et al., 1997]. [Bauer et al., 1999] proposed the following simplified equation:

$$F(h=0) = F(h=1) + F(h=-1)\tag{6.13'}$$

Here, h is equal to zero if and only if two points are neighbors in both the input space and the output space; thus, the overlap of Voronoi neighbors in I and O is required.

#### *6.1.13 Trustworthiness and Discontinuity (T&D)*

[Venna/Kaski, 2001] introduced the T&D measures, namely, trustworthiness and discontinuity. For each point j, let the points ݈∈H୨ሺknn, O\Iሻ be in the neighborhood consisting of the k nearest neighbors of the point j in the output space O, but not in the input space. Then, the T&D are defined as

$$F\_1(knn) = 1 - \frac{1}{N(knn)} \* \sum\_{j,\ } \left( \sum\_{l \in \mathcal{H}\_l(knn, 0 \cup l)} \{ R(j, l) - knn \} \right) \tag{6.14a}$$

$$F\_2(knn) = 1 - \frac{1}{N(knn)} \* \sum\_{l, \ \dots \ \mathcal{S}} \sum\_{l, \ \dots \ \mathcal{S}} \{ r(j, l) - knn \} \tag{6.14b}$$

∈ୌౠ , ሺ୩୬୬,୍\ሻ

where ܰሺ݇݊݊ሻ is a normalization factor that scales the values to the interval between zero and one [Kaski et al., 2003]. ܨଵ is the trustworthiness (T), and ܨଶ is the discontinuity (D). By counting the number of intruders, the T&D measures quantify the difference in the overlap of rankbased neighborhoods in I and O: ܨଵ represents the number of points that are incorrectly included in the input-space neighborhood, and ܨଶ represents the number of points that are incorrectly ejected from the input-space neighborhood.

[Venna/Kaski] claim that the trustworthiness (ܨଵ) quantifies from "how far from the original neighborhood [in the input space] the new points [݈∈ܫ [entering the [output-space] neighborhood [*H(knn, O/I)*] come" [Venna/Kaski, 2001, p. 487]. For the calculation of the T&D measures, KNN graphs must be generated for various knn values. Then, the trend of the curve can be interpreted. It is unclear how many knn values must be considered. Hence, knn values up to 25% of the total number of points are plotted. [Lee/Verleysen] showed that the T&D measures can be expressed as a special case of the co-ranking matrix [Lee/Verleysen, 2009].

#### *6.1.14 U-ranking*

In [Ultsch/Herrmann, 2005], a QM based on a lattice was proposed. To generalize the idea to any projection method, one would use a graph. Let Γ be a graph, and let g(l, j, Γ) be the shortest path between the projected points ሺ݆, ݈ሻ ∈ ܱ; then, the U-distance can be generalized as

$$u(j,l) = \mathbf{g}(l,j,\Gamma)\tag{6.15}$$

Let ൫ݑሺ݆, 1ሻ , . . . , ݑሺ݆, ݊ሻ൯ be the ascending sequence of all U-distances, as defined in Eq. 6.15, with respect to an arbitrary projected point *j*. The rank ݎሺ݆, ݈ሻ ൌ ݕ ∋ ሼ1, … , ݊ሽ represents the ݕ௧ position in the consecutive sequence of all U-distances ݑሺ݆, ݈ሻ with respect to a projected point ݈∈ܱ. Now, the minimal U-ranking measure can be defined as follows:

$$F(j) = \sum\_{\substack{l \in \{l \mid \mathbf{x}\_l \in H\{\mathbf{x}\_{l'}, l'\}\}}} r(j, l) \tag{6.15'}$$

Considering [Lötsch/Ultsch, 2014], a good choice for ߁ is the Delaunay graph ࣞ.

# *6.1.15 Overall Correlations: Topological Index (TI) and Topological Correlation (TC)*

Various applications of the two correlation measures introduced below can be found in the literature.

The first type of correlation was introduced in [Siegel/Castellan, 1988] as Spearman's ߩ and, in the context of metric topology preservation, was renamed as the topological index (TI) in [Bezdek/Pal, 1993]; see [Bezdek/R Pal, 1995] for further details. In Eq. 6.16, we follow the definition of the TI given in [Bezdek/R Pal, 1995], with ߢ ൌ ݊ሺ݊ െ 1ሻ/2, where n is the number of distances:

$$F = 1 - \frac{6}{\kappa^3 - \kappa} \sum\_{l,l=1}^{\kappa} \left( R(j, l) - r(j, l) \right)^2 \tag{6.16}$$

The values of the TI are between zero and one, but [Goodhill et al., 1995] argued that the values of Spearman's ߩ depend on the dimensions of the input and output spaces. Moreover, research has indicated that the elementary Spearman's ߩ does not yield proper results for topology preservation [Karbauskaitė/Dzemyda, 2009].

[Handl et al., 2006] used the Pearson correlation, which is also called the topological correlation (TC) [Doherty et al., 2006]. The latter is notable because Delaunay-graph distances are used instead of Euclidean distances, as illustrated in the following equation:

$$F = \frac{1}{N} \sum\_{\mathbf{l}} \left( \mathbf{g}(\mathbf{l}, \mathbf{j}, \mathcal{D}) - \hat{\mathbf{g}}(\mathcal{D}) \ast \kappa^{-1} \right) \ast \left( \mathbf{G}(\mathbf{l}, \mathbf{j}, \mathcal{D}) - \hat{\mathbf{G}}(\mathcal{D}) \ast \kappa^{-1} \right) \tag{6.17}$$

where gሺ ࣞሻ and Gሺࣞሻ are the means of the entries in the lower half of the distance matrices and ߢ ൌ ݊ሺ݊ െ 1ሻ/2, with *n* being the number of distances. The TC is preferable to the TI as a means of characterizing topology preservation because in the case of the TI, the matching of extreme distances is sufficient to yield reasonably high overall correlation values [Handl et al., 2006].

#### *6.1.16 Zrehen's Measure*

Zrehen's measure operates on the empty ball condition of Gabriel graphs [Gabriel/ Sokal, 1969]. The neighborhood of each pair of projected points (l, j) in the output space is depicted using locally organized cells:

*"A pair of neighbor cells A and B is locally organized if the straight line joining their weight vectors W(A) and W(B) contains points which are closer to W(A) or W(B) than they are to any other" [Zrehen, 1993, p. 664].* 

In this work, the strong connection between the TF value ܨሺെ1ሻ and Zrehen's measure [Bauer et al., 1999] is remarked, but in contrast to [Zrehen, 1993], who assumed a neural net in two dimensions with precisely defined neighborhoods, here the output-space neighborhood is generalized to a Gabriel graph representation. Furthermore, for each pair of nearest neighbors, the TF considers the neighborhood order *h* for that pair, whereas [Zrehen, 1993] counts the number of intruding points in neighborhoods of all orders *h* (for details, see the section on the TF above). In summary, if the condition ሺ݈, ݆ሻ ∈ ܪሺ1, ܩܾܽݎ݈݅݁, ܱሻ is met, then all points ݉∈ܫ that lie between the corresponding points ሺ݈, ݆ሻ ∈ ܪሺܩܾܽݎ݈݅݁, ܫሻ are deemed intruders and are counted. The sum of the number of intruders for all pairs of neighbors is normalized using a factor that depends only on the size and topology [Zrehen, 1993]:

$$f(j,l) = \#\{\forall k \in I \mid \{l,j\} \colon \{l,k\} \in H\_f(Gabriel, l)\}\tag{18.17}$$

$$\begin{array}{c} \mathcal{G}(l,j,Gabriel) = 1 \land \\ G(j,k) < G(j,l) \end{array}$$

$$F = \frac{1}{N} \* \sum\_{j,l \neq j} f(j,l)\tag{6.18'}$$

where N is the number of data points. The range of F starts at zero and extends to positive infinity, with a value of zero indicating the best possible projection.

#### **6.2 Types of Quality Measures for Assessing Structure Preservation**

In general, three types of QMs and some special cases can be identified, as shown in Figure 6.3. The first type of measure is called *compact33* because a measure of this type compares the arrangement of all given points in the metric space as expressed in terms of distance. In the literature, the term *topographic* is often used for such measures, e.g., [Goodhill et al., 1995]. These measures depend on some kind of comparison between inter- and intracluster distances. Measures in the second group are based on a neighborhood definition and, analogously to the terminology used in chapter 3, are called *connected*. These QMs rely on a type of predefined neighborhood *H* based on graph theory with a varying neighborhood extent *k*; thus, these neighborhoods are denoted by ܪሺ݇, ߁,ܯሻ (see chapter 2 for the corresponding definition). The expression *topology preservation* is often used in reference to this type of measure, e.g., [Bezdek/R Pal, 1995]. The special cases are grouped together under the term SOM-based measures. These measures, namely, the quantization error [Uriarte/Martín, 2005] and the topographic error [Kiviluoto, 1996], are not considered any further here because they require calculations of the distances between the data points in the input space and the weights of the neurons (prototypes) in the output space in an SOM. Instead of prototypes, general projection methods consider projected points, which can also refer to the positions of neurons on a lattice. Distances between spaces of unequal dimensions are not mathematically defined. A number of high-quality reviews are available on the subject of measuring SOM quality [Bauer et al., 1999; Beaton et al., 2010; Pölzlbauer, 2004].

The neighborhood-based QMs are divided into two groups, called *unidirectional measures* and *direction-based measures*. The reason for this is explained in chapter 2, section 2.2.1: two points *(j, k)* that lie in the same direct neighborhood of point l in ܪሺ1, ࣞ, ܯ (may not lie in the same neighborhood ܪሺ݇݊݊ ൌ 2, ܭ, ܯ (in the KNN graph if the distance D(l, k) is greater than the distance D(l, m) for a point m behind point j (see Figure 2.4 in chapter 2.2.1).

<sup>33</sup> Analogously to the usage of this term in chapter 3, where a compact structure is defined by inter- versus intracluster distances.

Figure 6.3: Groups of quality measures (QMs). The "Compact" group is only able to evaluate projections of compact structures (shaded with the first pattern), whereas the group of "Connected" QMs should be able to evaluate projections of connected structures (shaded with the second pattern) if the neighborhood definition is properly chosen. SOM-based measures are QMs that require weights of neurons (prototypes) and therefore are not generalizable to every projection method. Supervised methods are not considered here (see chapter 3 for details).

Abbreviations: trustworthiness and discontinuity (T&D), mean relative rank error (MRRE), local continuity meta-criterion (LCMC) and rescaled average agreement rate (RAAR).

#### *6.2.1 Theoretical Assessment of Quality Measures*

A good QM should reflect the quality of structure preservation and have the following properties:


QMs for evaluating the preservation of compact structures are easily interpretable; this is because they measure the quality of the preservation of distances. In most cases, the outcome is a single value in a specified range. However, no projection is able to completely preserve all distances or even the ranks of the distances [Drygas, 1978; Kirsch, 1978; Schmid, 1980]; here, it is argued that only the preservation of discontinuities in the distances is important. Therefore, any attempt to measure the quality of a projection by considering all distances is greatly disadvantageous. For example, the major disadvantage of the stress and the C measure is that the largest distances, which are likely associated with outliers in the data, exert the strongest influences on the F value. Moreover, the C measure does not consider gaps. Correlation measures

capture only linear correlations; however, in most cases, a non-linear projection method is required for structure preservation [Verleysen et al., 2003]. Additionally, outliers resulting in extreme distances are over-weighted in all correlation approaches.

QMs of the second type, connected measures, compare only local neighborhoods *H*. For unidirectional connected QMs, it is necessary to choose the correct number of k nearest neighbors, which is a complicated problem in itself. Even worse, for the comparison of different projection methods, it may be necessary to choose different knn values for the output space if there is a need to measure structure preservation. For this reason, unidirectional QMs that result in a single value, such as König's measure [König, 2000], do not satisfy quality conditions I and II. In other approaches, e.g., MRRE and T&D, two F values are obtained for every knn, and it is necessary to plot both functions, ܨଵ/ଶሺ݇݊݊ሻ. In this case, no distinction is possible between gaps and FPEs. Any further comparison of functional profiles for different projection methods is abstract and, consequently, not easily interpretable. Notably, the co-ranking matrix framework defined in [Lee/Verleysen, 2009, 2010] allows for the comparison, from a theoretical perspective, of several measures (the MRRE, T&D, and LCMC measures) based on ܪሺ݇݊݊, ܭ, ܯ(. However, no transformation of the co-ranking matrix into a single meaningful value exists [Mokbel et al., 2013], and the practical application of co-ranking matrices is controversial [Lueks et al., 2011]. With regard to the LCMC, [Chen/Buja, 2009] showed that it is statistically unstable and not smooth. Consequently, conditions I and II are not met, but the KNN graph is always calculable (IV).

The direction-based approach has the advantage that a distinction between FPEs and gaps is possible. However, an obvious disadvantage is the very high cost of calculation: ܱ ቀ݀ మቁ for a Delaunay graph and ܱሺ݊ଶሻ for a Gabriel graph [Aupetit, 2003]. [Villmann et al., 1997] attempted to solve this problem by proposing an approximation of the intrinsic dimension of [Grassberger/Procaccia, 1983]. In theory, the TF seems to be the best choice, but in the context considered here, a projection is defined as a mapping into a lower-dimensional space. In this case, the quality measure F(h) is equal to zero for h<0. It follows that F(h=0)=F(h=1)+F(h=- 1)=F(h=1). Consequently, half of the definition proves to be useless for the purpose considered here. The second problem is that the TF does not consider the input distances, apart from calculating the Delaunay graph in the input space. Thus, there is no difference between FPE and BPE, as long as no other points lie in between. Further disadvantages include numerical instability, because the Delaunay graph is sensitive to rounding errors in higher dimensions, and the fact that the Delaunay graph does not always correctly preserve neighborhoods if the intrinsic dimensionality of the data does not match the dimensionality of the output space O [Bauer et al., 1999].

Based on the classification of the QMs into semantic groups, here, one is able to identify several approaches that have not yet been considered. For example, one could develop a QM based on unit disk graphs.

#### *6.2.2 Practical Assessment of Quality Measures*

Various QMs were used to evaluate the structure preservation of projections of the Hepta and Chainlink data sets. In supplement A**,** it is shown that every approach used to measure the quality of projection methods is based on the preservation of discontinuities only when the discontinuities serve as a representation of compact or connected structures (directed or unidirectional). Consequently, the assessment of projections using QMs requires prior assumptions about the underlying structure of the data. If these assumptions are wrong, the QM will fail to correctly measure the projection quality. Figure 6.4 and 6.5 show the compact QM results obtained using the Shepard density plot method, introduced earlier in the chapter, for the Hepta and Chainlink data sets. It is possible to evaluate the preservation of compact structures in the Hepta data set (Figure 6.4), whereas the evaluation of the preservation of connected structures fails (Figure 6.5).

None of the QMs is fully credible. This is because none of them is able to measure structure preservation in all possible cases of the existence of discontinuities in the input space. To date, QMs have mostly been applied to data sets, such as a Swiss roll [Mokbel et al., 2013] or a sphere [Venna et al., 2010], for which the problem lies only in the visual representation of a continuous high-dimensional object. Therefore, the aim has been to measure the BPE and FPE. However, these examples show that structure preservation is more important, and if the goal is to visualize structures that can be used in clustering algorithms, higher FPEs and BPEs are sometimes necessary.

In supplement A, the simple Hepta example shows that every connected QM has difficulty capturing the quality of structure preservation. This is because such measures depend on compact structures defined by intra- versus intercluster distances (in a Euclidean graph). The Chainlink example illustrates that compact QMs are unsuccessful because each ring is closer to some points in the other cluster than it is to points in its own cluster, and therefore, the relevant structures are of the connected type. The density plots obtained using the Shepard diagram and Kendall's ߬ approaches are only able to capture discontinuities that can be unambiguously identified based on the intra- versus intercluster distances. This is not the case for the Chainlink data set, and consequently, these compact QMs fail for this data set. Moreover, because some connected QMs are not direction-based, even they encounter difficulties in evaluating structure preservation.

It seems that in the case of discontinuities in data and data sets that contain natural clusters, the user must make certain assumptions regarding which structures are most relevant and should be preserved. Based on this decision, the user can choose the most appropriate QM. Furthermore, the problem of trial-dependent projections, which is mostly ignored in the literature, is demonstrated in the example of the CCA projection of the Chainlink data set.

Figure 6.4: Density plots of the Shepard diagrams [Shepard, 1980] of the four projections of the Hepta data set shown in chapter 5, Figure 5.2. It is clearly apparent that PCA best preserves the structure of the data.

Figure 6.5: Density plots of the Shepard diagrams (density plots) for three projections of the Chainlink data set. PCA appears to produce the best projection of the data set, but in reality, it results in the worst structure preservation (see the supplement A). No clear difference between the CCA projections can be distinguished.

#### **6.3 Introducing the Delaunay Classification Error (DCE)**

On the one hand, QMs have difficulty measuring structure preservation when discontinuities exist in data sets (supplement A). On the other hand, in the case of natural clusters, discontinuities are important for cluster analysis, and projections of high-dimensional data sets should be able to visualize cluster structures accordingly. Consequently, identifying the most suitable method of evaluating projections of high-dimensional data for every case of high-dimensional discontinuities, with no available prior classification, remains an unsolved problem. However, if a prior classification of the data is known and if it represents patterns characterized by discontinuity, then these structures can be used for projection evaluation.

In chapter 5, it was shown that for every projection produced by any projection method, the generation of a U-matrix is possible. Consequently, the approach proposed herein assumes that an abstract U-matrix is available for every projection, as proven in [Lötsch/Ultsch, 2014] in the case of SOMs. Therefore, a Delaunay graph can be computed in the output space, and the edges are weighted using the high-dimensional distances in the input space.

Let ܿ∈ܥ be the classification of the points ݆∈ܫ in the input space, where ܥ is a cluster of C and N=|I|. Let ݈∈ܱ be the projected points in the output space that are mapped to I, and let ܪሺ1, ܦ݈݁, ܱሻ be the direct neighborhood of j in the Delaunay graph in the output space. Then, the neighboring points of j are sorted using the Euclidean input-space distances between j and these neighboring points ݈ ∈ ܪሺ1, ܦ݈݁, ܱሻ:

ܪ෩ሺ1, ܦ݈݁, ܱ, ݇݊݊ሻ ൌ ൛݈ ∈ ܪሺ1, ܦ݈݁, ܱሻห ∀ ݈ଵ,…݈, ܦሺ݈ଵ, ݆ሻ ൏ ܦሺ݈ଶ, ݆ሻ ൏ ⋯ ൏ ܦሺ݈, ݆ሻሽ ሺ6.19aሻ 

where the number of nearest neighbors considered is

$$km \in \mathbb{N}, \qquad km \le |\mathsf{H}\_{\mathsf{I}}(\mathsf{1}, Del, O)|\tag{6.19b}$$

Then, the incorrectly classified points in the neighborhood ܪ ෪ఫሺ1, knn, ܦ݈݁, ܱሻ can be counted as follows:

$$\left| \left| \mathcal{C}\_{\mathbf{k}}(\mathbf{l}) \right| \right| = \left| \{ p \in I, j(p) \in \mathcal{O} \, | \, \forall p, j(p) \in | \tilde{H}\_{\mathbf{l}}(\mathbf{1}, \operatorname{Del}, \mathbf{O}, \operatorname{knn}) \Big| \, \land \\ \left. p \notin \mathcal{C}\_{\mathbf{k}}(\mathbf{l}) \right| \right| \\ \tag{6.19c}$$

Finally, the DCE measure is defined as

$$DCE = \frac{1}{N} \sum\_{\text{kmn}=2}^{\text{k}} \sum\_{l=1}^{N} \frac{|\bar{C}\_{\text{k}}(l)|}{|\tilde{H}\_{l}(1, Del, O, kmn, l)| - 1} \tag{6.19d}$$

A low DCE value indicates a structure-preserving projection. Following the discussion in [Ultsch, 2016a], the DCE can be simplified to

$$DCE = \sum\_{l,f=1}^{N} HD\_f(N) \* cc\_{lf} \tag{6.19e}$$

where ܦܪሺܰሻ ൌ ሼ1,1 <sup>ଵ</sup> <sup>ଶ</sup> , … ,1 <sup>ଵ</sup> <sup>ଶ</sup> . . . <sup>ଵ</sup> ሽ is the vector of the decay function and ܥܥ is an NxN matrix with the following definition. Let ܰܰ ൌ ܦ ∗ ܦ݈݁ܽݑܽ݊ݕ be the distance matrix multiplied by the Delaunay adjacency matrix, where every element of this adjacency matrix is defined as

$$delay\_{l|l} = \begin{cases} 1, & \text{if } l \text{ and } j \text{ are connected} \\ \infty, & \text{if } l \text{ and } j \text{ are not connected} \end{cases} \tag{6.19f}$$

Let ܰܰ෫పఫ be the matrix ܰܰ with the columns sorted in ascending order; then, every element of the matrix ܥܥ is defined as

$$\mathbf{c}c\_{lj} = \begin{cases} \mathbf{0}, & \text{if } l \text{ and } j \text{ are in the same class} \\ \mathbf{1}, & \text{otherwise} \end{cases} \tag{6.19g}$$

With the help of [Ultsch, 2016a], the harmonic decay function is approximately ܦܪሺܰሻ ൎ ݈݃ሺܰሻ 0.5772156649 1/ሺ2 ∗ ܰሻ. It assigns the heaviest weights to the errors that are nearest to a given point. The range of the DCE, which is approximately ሾ0, ܰ ∗ ∑ ݈݃ሺ݅ሻ 0.5772156649 1/ሺ2 ∗ ݅ሻ ே ୀଵ ሿ, can be restricted to ሾെ2,2ሿ by calculating a baseline. An example of a baseline is a NeRV projection ([Venna et al., 2010]) with ߣ ൌ 0.5, which means that the precision and recall are equally weighted. The relative difference can be calculated as

$$RelDiff = \frac{\mathbf{x} - \mathbf{y}}{0.5 \ast (\mathbf{x} + \mathbf{y})} \qquad \qquad \text{(6.19h)}$$

Then, the normalized DCE is defined as

$$F = RelDiff(DCE, baseline) \tag{6.19}$$

When the relative difference is used in this way, the range of values is fixed to ሾെ2,2ሿ. A positive value indicates a lower error compared with the baseline projection, whereas a negative value indicates a higher error compared with the baseline. In addition, the use of the relative difference enables the comparison of different projection methods in a direct and statistical manner.

#### *6.3.1 Summary*

Overall, 19 QMs were reviewed in this chapter, and the most common measures used to assess the quality of projections were compared. The QMs were grouped into semantic classes with the aid of graph theory. The QMs presented in the literature require prior assumptions regarding the underlying high-dimensional structures in a data set of interest (examples, see supplement A). Here, it is argued that for structure preservation, one must assume the presence of discontinuities in the high-dimensional data, which should correspond to gaps in their two-dimensional projection. In the case of such structures, the QMs reviewed here seemingly do not capture the important and unavoidable errors that occur in the projections because they assume certain definitions regarding which types of neighborhoods should be preserved (see supplement A).

Otherwise, an objective function could be defined using the best QM, and it would always be possible to obtain a structure-preserving two-dimensional visualization or clustering by optimizing this objective function.

Hence, a new QM is required to measure the quality of structure preservation. It must utilize information provided by a prior classification. The DCE is formulated based on the idea that an abstract U-matrix is available for every projection method, as demonstrated in [Lötsch/Ultsch, 2014] for the case of SOMs. A generalized U-matrix visualization called topographic map method for any arbitrary projection method was presented in the previous chapter. The DCE allows projections to be ranked and normalized compared with a baseline and also enables statistical testing.

This work will present an alternative approach using swarm intelligence, self-organization, and the Nash equilibrium concept [Nash, 1950] from game theory, with the goal of eliminating the need for an objective function. The expectation is that novel and coherent properties that can be used for visualization and clustering will emerge from such a system. Chapter 7 will explain the relevant concepts, and chapter 8 will introduce the Pswarm projection method, which serves as part of the Databionic swarm clustering algorithm.

License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. **Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **7 Behavior-based Systems in Data Science**

Many technological advances have been achieved with the help of bionics, which is defined as the application of biological methods and systems found in nature. A related, rarely discussed subfield of information technology is called databionics. *Databionics* refers to the attempt to adopt information processing techniques from nature. This chapter will discuss the imitation of natural processes (also called biomimicry [Benyus, 2002]) using swarm intelligence, which is a form of artificial intelligence (AI) [Bonabeau et al., 1999] and was introduced as a term in the context of robotics [Beni/Wang, 1989]. In the context considered here, AI may be described as a field of study that seeks to explain and emulate intelligent behavior in the form of a computational process34 [Russell et al., 2003, p. 5].

Consequently*, swarm intelligence* is defined as the emergent collective behavior35 of simple entities called agents36[Bonabeau et al., 1999, p. 12]. An agent is a software entity, situated37 in a given environment, that is capable of flexible, autonomous action in order to meet its design objectives [Jennings et al., 1998]. In the context of swarms, the terms behavior and intelligence are used synonymously, bearing in mind that in general, the definition of intelligence is controversial [Legg/Hutter, 2007] and complex [Zhong, 2010]. The properties of swarm behavior will be explained later in this section.

*"There are […] three key concepts […] [related to agents]: situatedness, autonomy, and flexibility. Situatedness, in this context, means that the agent receives sensory input from its environment and that it can perform actions which change the environment in some way" [Jennings et al., 1998, p.8].* 

Autonomy refers to an agent's capability for independent, decentralized action, and flexibility refers to its ability to proactively respond to its environment in a "timely fashion" [Jennings et al., 1998].

Inspired by Beni's definition of intelligent robots [Beni/Wang, 1993, p. 705], here, an intelligent agent is described as one whose behavior is neither random nor predictable [Beni, 2004, p. 4]. On the one hand, "intelligent behavior is the production of something ordered, i.e., unlikely to occur: an improbable outcome" [Beni, 2004, p. 3]. On the other hand, unpredictability is not equivalent to intelligence; a roulette, for example, is not intelligent [Beni, 2004, p. 3]. "It seems that somehow both unpredictability and the creation of some order are necessary to be able to speak of "intelligence" [Beni, 2004, p. 3]. In the context of data science, the first intelligent agents to be developed and applied were called DataBots [Ultsch, 2000a]. DataBots possess probabilistically defined movement strategies, take in food, consume food and store quantities of food. However, the question of whether DataBots themselves exhibit swarm intelligence is controversial [de Buitléir et al., 2012, p. 2], and as such, they will be separately introduced in the next section. It will be shown that in the case of swarm-organized projection (SOP)

M. C. Thrun, *Projection-Based Clustering through Self-Organization and Swarm Intelligence*, https://doi.org/10.1007/978-3-658-20540-9\_7

<sup>34</sup> The author focuses on AI in the context of behavior; however, thought process and reasoning types of AI also

exist, of which neural networks and Bayesian learning are representative examples. 35 The term collective behavior generically denotes any behavior of agents in a system of more than one agent [Cao et al., 1997]. 36 See also a similar definition in [Martens et al., 2011, p. 2].

<sup>37 &</sup>quot;The word "situated," […] is intended to emphasize that the process of deliberation takes place in an agent that is directly connected to an environment" [Russell et al., 2003, p. 422].

[Herrmann, 2011], DataBots do not exhibit swarm intelligence, whereas in the case of Pswarm (introduced in the next chapter), they do.

Another example of the use of intelligent agents is Schelling's segregation model [Schelling, 1969, 1971]. The model consists of a lattice of square patches (tiling). Agents are located on this landscape, initially at random, with no more than one on any patch. The agents are of two different types, e.g., blue and red, and there are free patches available. Each agent has a tolerance parameter. A blue agent is "happy" when the ratio of blues to reds in its Moore neighborhood (the eight immediately adjacent patches) is above its tolerance threshold. Unhappy agents are allowed to move randomly to a new open position (white). Schelling's segregation model leads to segregation of the agents, even when individual agents have only a mild preference for living near agents of the same type. An example of the segregation process is illustrated in Figure 7.1.

*"Originally the model was intended to explain how racialized city ghettos might emerge from individual choices, given even slight racial biases. Some important constraints on effective segregation have been described by [Vinković/Kirman, 2006]. Segregation is greatly increased if agents are allowed to jump to any node that yields less stress, instead of neighbouring nodes only" [Herrmann, 2011, pp. 54-55].* 

Swarm behavior can be imitated based on observations of herds [Wong et al., 2014], bird flocks and fish schools [Reynolds, 1987], bats [Yang/He, 2013], or insects such as bees [Karaboga, 2005; Karaboga/Akay, 2009], ants [Deneubourg et al., 1991], fireflies [Yang, 2009], cockroaches [Havens et al., 2008], midges [Passino, 2013], glow-worms or slime moulds [Parpinelli/Lopes, 2011]. [Grosan et al.] define five main principles of swarm behavior: *Homogeneity,* meaning that every agent has the same behavior model; *Locality,* meaning that the motion of each agent is influenced only by its nearest neighbors; *Velocity Matching,* meaning that each agent attempts to match the velocity of nearby flockmates; *Collision Avoidance*, meaning that each agent avoids collisions with nearby agents; and *Flock Centering*, meaning that agents attempt to stay close to neighboring agents [Grosan et al., 2006, p. 2; Reynolds, 1987, pp. 6, 7]. Here, these definitions are given greater specificity in two respects.

First, the term *agent* is modified to the term *agents of the same type* because many swarms consists of more than one type of agent, e.g., small and large workers in the Pheidole genus of ants [Bonabeau et al., 1999, pp. 3, 4].Second, a swarm need not necessarily move. For example, fire ants self-assemble into waterproof rafts to survive floods [Mlot et al., 2011]. The individual ants are linked together to construct such self-assemblages [Mlot et al., 2011]. Therefore, velocity matching can result in a velocity of zero.

Figure 7.1: The Schelling model of a liquid on a periodic lattice [Vinković/Kirman, 2006, Fig. 5 a]. After 225 mil. steps the agents are fully segregated. The segregation requires many iterations if agents are allowed only to jump to the positions nearest to them.

If a swarm contains a sufficient number of agents, self-organization may emerge. *Self-organization* is defined as the spontaneous formation of patterns by a system itself [Kelso, 1997, p. 8 ff.], without any central control. The snowflake in Figure 7.2 serves as an example of selforganization. During self-organization, novel and coherent structures, patterns, and properties may arise [Goldstein, 1999]. This ability of a system to produce phenomena on a new, higher level is called *emergence* [Ultsch, 1999], and it is separately discussed in the next section.

"Self-organizing swarm behavior relies on four basic ingredients" [Bonabeau et al., 1999, pp. 22-25]: *positive feedback*, *negative feedback*, *amplification of fluctuations* and *multiple interactions*. The first two factors promote the creation of convenient structures and help to stabilize them. Fluctuations are defined to include errors, random movements and task switching. For swarm behavior to emerge, multiple interactions are required. Agents can communicate with each other either directly or indirectly. An example of direct communication is the dancing behavior of bees, in which a bee shares information about a food source, such as how plentiful it is and its direction and distance away [Karaboga/Akay, 2009]. Indirect communication is observed, for example, in the behavior of ants [Schneirla, 1971]. If the agents communicate only through modifications to their environment (through pheromones, for example), then this type of communication is defined as *stigmergy* [Beckers et al., 1994; Grassé, 1959].

The exact number of agents required for self-organization is unknown, but it should be not so large that it must be handled in terms of statistical averages and not so small that it can be treated as a few-body problem [Beni, 2004]. For example, 4096 neurons are required for selforganization in SOMs [Ultsch, 1999], and for the coordinated marching behavior of locusts, a minimum density of least 73.8 ݈ܿݏݐݏݑ/݉ଶ was reported in [Buhl et al., 2006, p. 1404].

Figure 7.2: Example of self-organization: a large, 10.1x10.1 mm snow crystal [Libbrecht, 2016]. This snow flake is a spontaneous formation of a pattern by molecules of ܪଶܱ.

Considering the two requirements stated above, Beni defined a swarm as a formation of cellular robots with a number exceeding 100 [Beni, 2004]. Here, consistent with [Beni, 2004], the argument is made that for self-organization38, the number of agents should be higher than 100. The two main types of swarm-based analysis discussed in data science, namely, particle swarm optimization (PSO) and ant colony optimization (ACO) [Martens et al.], are distinguished by the type of communication used: PSO agents communicate directly, whereas ACO agents communicate through stigmergy. PSO methods are based on the movement strategies of particles [Kennedy/Eberhart, 1995] and typically used as population-based search algorithms [Rana et al., 2011], whereas ACO methods are applied for sorting tasks [Martens et al., 2011]. In addition to being used to solve discrete optimization problems, PSO has been used as a basis for rulebased classification models, e.g., AntMiner, or as an optimizer within other learning algorithms [Martens et al., 2011], whereas ACO has been used primarily for supervised classification within the data mining community [Martens et al., 2011]. Pseudocode for both types of algorithms and illustrative descriptions can be found in [Abraham et al., 2006].

#### **7.1 Artificial Behavior Based on DataBots**

The term DataBots refers to agents in the sense discussed here. DataBots were introduced in [Ultsch, 2000a] as the first artificial-behavior-based approach to data science. Each DataBot b୨ ∈ ܤ has a position i୨ϵ O and takes in food, consumes food and stores quantities of food. Quantities of food are represented by numbers in the range from 0% to 100%. All positions lie on a toroidal lattice, and each DataBot is capable of detecting a scent λ at its current position. This approach is used to perform clustering tasks.

In [Ultsch, 2000c], each DataBot possesses an opinion, defined by one high-dimensional data point, and the DataBots are used as a projection method for a classification task. The movement of the DataBots is defined in terms of probabilities, which are computed using various movement programs called strategies, for each of the four directions (south, east, west and north) and for no movement (origin). With the use of these strategies, self-organization of the system is possible. Unlike in ACO methods, each DataBot possesses an opinion defined by a high-dimensional data point [Ultsch, 2000c]. Hence, reduction of the agents is impossible.

[Kämpf/Ultsch] suggested the use of movement strategies with a decreasing neighborhood radius. The underlying idea of the decreasing radius approach is to promote self-organization, first of a global structure and then of local structures [Kämpf/Ultsch, 2006]. In [Herrmann/Ultsch, 2008b], a set of additional strategies was defined for a subset of DataBots based on labeled data, requiring a prior classification. The authors used this approach to address a classification task by combining it with emergent self-organizing map (ESOM) and the grayscale two-dimensional U-matrix method. The U-matrix was partitioned into clusters using an entropy-based heuristic algorithm called U\*C [Ultsch, 2006].Here, it is assumed that the Data-Bots are defined similarly to their definition in [Herrmann/Ultsch, 2008b]: Let each DataBot b୨ ∈ ܤ be an agent identified by a numerical vector ݖϵԹௗ; it resides on a large, finite, twodimensional discrete lattice that is embedded on the surface of a torus [Ultsch, 2003a]. The

<sup>38</sup> Beni himself only indirectly restricted systems that exhibit self-organization to those consisting of more than 100 agents [Beni, 2004].

current position of DataBot b୨ is denoted by i୨ϵ O. Every DataBot b୨ ൌ ሼi୨, z୨ሽ emits a scent λ, which is detected by all other DataBots in its neighborhood.

By analyzing ant-based clustering39 (ABC) [Lumer/Faieta, 1994] and the batch self-organizing map (batch-SOM) method [Kohonen/Somervuo, 2002] the local stress of an ABC projection40 can be extracted [Herrmann, 2011, pp. 137-138; Herrmann/Ultsch, 2008a, p. 3; 2008c, p. 217; 2009, p. 4]: It is an upper limit on the best matching unit criterion41 of batch-SOM and forms the topographic term of the Attractiveness function used in ant-based clustering. [Ultsch/Herrmann, 2010] used this mathematical stress term to define a scent as follows:

Let D(l, j) be the distance between two points x୪, x୨ ∈ I, let d(l, j) be the corresponding distance in the output space O, and let ݄ோ: ܴ െ ሾ0,1ሿ be an arbitrary but continuous and monotonically decreasing function; then, the scent ߣ൫ܾ, ܴ൯: Թ ା ൈܱ→Թ ା is defined as

$$\lambda\{\mathbf{b}\_{\mathbf{\hat{i}}},\mathbf{R}\} = \frac{\sum\_{\mathbf{l}\in I} h\_{\mathbf{R}}\{d(\mathbf{j},\mathbf{l})\} \* D(\mathbf{j},\mathbf{l})}{\sum\_{\mathbf{l}\in I} h\_{\mathbf{R}}\{d(\mathbf{j},\mathbf{l})\}}\tag{7.1}$$

The scent ߣ is the weighted sum of the distances to neighboring objects; consequently, ݄ோ "realizes a neighborhood function by means of focus" [Herrmann, 2011, p. 65]. To better distinct this neighborhood function from the Databionic swarm, in the following chapters it will be referred to with the same capital letter ܨோ ൌ ݄ோ as in [Herrmann, 2011].

#### *7.1.1 Swarm-Organized Projection (SOP)*

The discussion in this section is based on the thesis of [Herrmann, 2011], which is a continuation of the work of [Herrmann, 2009; Ultsch/Herrmann, 2010]. The SOP algorithm was proposed as a self-adaptive projection method with the aim of creating a cohesive visualization of clusters [Herrmann, 2011]. The algorithm combines a DataBot approach, a scent definition derived from the above analysis of ABC, and Schelling's segregation model [Schelling, 1969]: the better (weaker) the scent ߣ becomes, the happier the DataBot is. The SOP algorithm, as presented in Listing 7.1, operates on a finite data set with pairwise dissimilarities, which are usually defined as Euclidean distances [Herrmann, 2011]. The numeric vector z୨ associated with each DataBot b୨ represents a high-dimensional data point, and the cardinality of the data set I is equal to the number of DataBots. The positions of the DataBots are defined on a rectangular lattice tiling (quad grid) O, which is typically toroidal but could also be planar, in Cartesian coordinates ݅ሺݔ, ݕሻ߳ *O*, where the numbers of lines *L* and columns *C* must be set by the user. Every DataBot chooses between its current position and one new position. If the scent ߣ, which is defined by the function ܨோ, would be weaker in its new neighborhood, then the DataBot jumps to the new position. Another DataBot may already be located in the new position, but this does not affect the decision to jump.

In each iteration, all DataBots are allowed to move simultaneously [Herrmann, 2011]. An epoch ends when the following condition is met [Herrmann, 2011]: As long as the number of DataBots that want to jump exceeds an arbitrary threshold value, called a fixed point in [Herrmann, 2011],

<sup>39</sup> See next section for a more detailed description.

<sup>40</sup> In [Herrmann/Ultsch, 2008a] called topographic mapping.

<sup>41</sup> It "is a weighted sum of local input space distances" [Herrmann/Ultsch, 2009, p. 4].

the current epoch proceeds to the next iteration. Otherwise, the next epoch starts, with a decrease in the neighborhood radius R. To ensure the convergence of the algorithm, a maximum number of iterations must be set. [Kohlhof, 2010] proposes a 5% threshold and a maximum number of 500 iterations, but in [Herrmann, 2011], no exact numbers are indicated.

The maximum possible distance in the map space is defined by ܴ௫ ൌ √ܮଶ ܥଶ, and the algorithm ends if the smallest possible radius ܴൌ1 is reached [Herrmann, 2011, p. 65]. The following contradiction should be taken into account: sometimes, a different minimal radius (e.g., R=8 in [Herrmann, 2011, p. 118] for the gene data set, *R>1* in [Herrmann, 2011, p. 167] for the GPD194 data set) is chosen without any scientific basis other than the author's experience. In practice, the neighborhood function ܨோ is chosen to be a Gaussian function where the mean is equal to zero and the standard deviation is equal to the radius *R*. Each possible new position is drawn from a Gaussian-shaped probability distribution (Fig 4.1) [Herrmann, 2011, p. 64]. Pseudocode for the SOP algorithm is provided in [Herrmann, 2011, p. 65], with the scent ߣሺܾሻ defined as in equation (1).

Previous work has revealed, based on the practical experience of the inventor [Herrmann, 2009], that SOP is almost as good as or even better than the best of its carefully parameterized competitor methods, such as curvilinear component analysis (CCA), t-distributed stochastic neighbor embedding (t-SNE) and ESOM, in terms of the 1-nearest-neighbor classification accuracy and the specially formulated dispersion measure of [Herrmann, 2011, p. 101] on several natural and artificial data sets.

#### *function O=sop(I)*

 *for all* z୧ϵ *I: assign an initial random Cartesian position* iሺx, yሻϵ *O on the lattice to generate DataBots* b୧ ∈ B

```
 for R={Rmax,…, 1} do
```
 *m=Gaussian(R) of a Gaussian-shaped distribution:* Nሺmሺxሻ, sሻ Nሺmሺyሻ, sሻ

 *iteration=0* 

#### *repeat*

*for j={1,…n} do* 

l ൌ argmin୨ሺλሺb୨ሻ*) with* j ൌ ሼi, mሽϵO

*end for* 

 *iteration = iteration +1* 

 *until {*l∈O *fix with* |ሼl ∈ O |lൌm|ሽ| ൏ *threshold) OR (*iteration i\_max*)}* 

 *return O* 

*end function SOP*

Listing 7.1: The swarm-organized projection (SOP) algorithm as described in [Herrmann, 2011, p. 65]. The are some parameters to be set by a user (e.g. *Rmax*, *threshold,\_max,* i\_max*, …*).

#### **7.2 Swarm Intelligence for Unsupervised Machine Learning**

As mentioned earlier in this chapter, there are two main types of artificial swarm optimization methods: PSO and ACO. In unsupervised learning, two additional approaches are known. The first one is based on bees [Karaboga/Akay, 2009], and the second is based on foraging theory [Stephens/Krebs, 1986].

For clustering tasks, PSO has mainly been applied in hybrid algorithms [Esmin et al., 2015]; e.g., [Van der Merwe/Engelbrecht, 2003] applied PSO combined with k-means clustering. Here, it is argued that the hybridization of PSO and k-means may improve the choice of centroids or may, in some special cases, even allow the problem of the number of clusters to be solved. However, this approach is subject to several of the shortcomings of k-means, which is known to search for spherical clusters [Hennig et al., 2015, p. 721]/[Hennig, 2015a, p. 18]; i.e., it is unable to find clusters in elementary data sets, such as those in the Fundamental Clustering Problems Suite42 (FCPS) [Ultsch, 2005a].

According to [Rana et al., 2011], the advantages of the clustering process when the PSO approach is used are that it is very fast, simple and easy to understand and implement. "PSO also has very few parameters to adjust [Eberhart et al., 2001] and requires little memory for computation. Unlike other evolutionary and mathematical algorithms it is more computationally effective" [Rana et al., 2011] (citing [Arumugam et al., 2005]). Again according to [Rana et al., 2011], the disadvantages are the "poor quality results when it deals with large and complex data sets". "PSO gives good results and accuracy for single objective optimization, but for a multi objective problem it becomes stuck in local optima" [Rana et al., 2011] (citing [Li/Xiao, 2008]). Another problem with PSO is its tendency to reach fast and premature convergence at midoptimum points [Rana et al., 2011]. It is difficult to find the correct stopping criterion for PSO [Bogon, 2013, p. 155], which is usually one of the following: a fixed maximum number of iterations, a maximum number of iterations without improvement or a minimum objective function error [Abraham et al., 2006; Esmin et al., 2015]. Hybrid PSO algorithms usually optimize an objective function [Bogon, 2013, pp. 39 ff, 46] and therefore always make implicit assumptions regarding the underlying structures of the data (see chapters 2, 4 and 5 for details). Notably, there is no single "best" criterion for obtaining a clustering because no precise and workable definition of "a cluster" exists [Jain/Dubes, 1988, p. 91]. For the task of dimensionality reduction, the swarm-inspired projection (SIP) method [Su et al., 2009] are discussed later in this section.

ACO methods for clustering tasks are referred to as ABC methods (for an overview, see [Kaur/Rohil, 2015]). ABC methods model the behavior of ant colonies, and data points are picked up and dropped off accordingly [Bonabeau et al., 1999]. ABC was introduced by [Deneubourg et al., 1991] as a way to explain the phenomenon of the gathering and sorting of corpses observed among ants. In an experiment (Figure 7.3), the ants formed cemeteries of dead ants that had been randomly scattered beforehand. [Deneubourg et al., 1991] proposed probability functions for the picking up and dropping off of the corpses. Because ants are very specialized in their roles, several different types of ants of the same species exist in a colony, and different individuals in the colony perform different tasks. The probabilities are calculated as functions of the number of corpses of the same type in a nearby area (positive feedback).

<sup>42</sup> See also the results presented in chapter 9.

Figure 7.3: Randomly scattered ant corpses are clustered by living ants in a matter of hours [Bonabeau et al., 1999, p. 151; Martens et al., 2011, Fig.5]. The different stage depicted correspond to 0, 3, 6 and 36 hours after the beginning of the experiment.

For a clustering task, the ants and data points (representing ant corpses) are randomly placed on a lattice, and the ants move randomly across the lattice, at times picking up and carrying the data points [Lumer/Faieta, 1994]. The probabilities of picking up and dropping off the data points are modified according to a dissimilarity-based evaluation of the local density (see [Kaur/Rohil, 2015] and [Jafar/Sivakumar, 2010], citing [Lumer/Faieta, 1994]).

[Handl et al., 2006] enhanced the algorithm; they called their version Adaptive Time-dependent Transporter Ants (ATTA) because they incorporated adaptive heterogeneous ants and timedependent transport activities into the algorithm. Further improvements to the picking up and dropping off activities were presented in [Omar et al., 2013; Ouadfel/Batouche, 2007], and improvements to the initialization and post-processing were proposed in [Aparna/Nair, 2014]. Another version of the approach was developed by introducing an annealing scheme [Tsai et al., 2004]. A feature of ABC algorithms is that the clustering objective is implicitly defined: neither the overall clustering objective nor the type of clusters sought is explicitly defined at any point during the clustering process43 [Handl/Meyer, 2007].

The main problem in ABC lies in the fact that the picking up and dropping off behaviors are independent of the number of agents required to execute the task [Herrmann, 2011, p. 81; Herrmann/Ultsch, 2008a, 2008c, 2009; Tan et al., 2006]. Furthermore, ABC methods can be regarded as derived from the batch-SOM algorithm [Herrmann/Ultsch, 2008a]. From this perspective, an ABC algorithm possesses an objective function, which can be decomposed into an output density term multiplied by one minus a topographic quality term [Herrmann, 2011, pp. 137-138; Herrmann/Ultsch, 2008a, p. 3; 2008c, p. 217; 2009, p. 4]. Both terms are minimized simultaneously [Herrmann/Ultsch, 2008a, 2008c, 2009]. The output density term is easy to optimize but distorts the correct clustering of the data. Here, it is argued that at least 100 agents are required for self-organization in a swarm. However, this many agents are not required in ABC methods, and consequently, the self-organization property of ABC-based swarm algorithms is controversial.Methods of the third type are founded on an analysis of the behavior of bees [Karaboga, 2005]. These are hybrid approaches to clustering that use swarm intelligence in combination with other methods, e.g., k-means44 [Karaboga/Ozturk, 2011; Marinakis et al., 2007; Pham et al., 2007; Zou et al., 2010] or SOM [Fathian/Amiri, 2008].

To the best of the author's knowledge, only seven instances of the application of AI in projection methods exist. One method is based on foraging theory, which focuses on two basic prob-

<sup>43</sup> This feature will be used in Databionic swarm.

<sup>44</sup> k-means is known to search for spherical clusters [Hennig et al., 2015, p. 721]/[Hennig, 2015a, p. 18]; see above.

lems: which prey a forager should consume and when a forager should leave a patch [Stephens/Krebs, 1986, p. 6]. A forager is viewed as an agent who compares a potential energy gain with a potential opportunity for finding an item of a superior type [Martens et al., 2011] (citing [Stephens/Krebs, 1986]). This approach is also called the prey model [Martens et al., 2011]: the average energy gain can be mathematically expressed in terms of the expected time, energy intake, encounter rate and attack probability for each type of prey. In the projection method proposed by [Giraldo et al., 2011], in addition to the characteristics of the approach described above, the "foraging landscape was viewed as a discrete space, and objects representing points from the dataset as prey." There were three agents defined as foragers. Here, the approaches based on the prey model are classified as basic swarm algorithms.

A second method, called the self-organizing swarm (SOSwarm) method, is a clustering method based on a hybrid of PSO and SOM [O'Neill/Brabazon, 2008]. In SOSwarm, 100 particles were used on a 10x10 SOM feature map. However, because only a few units are used, SOSwarm represents a combination of k-means-SOM (see chapter 3) with PSO. Thus, it can be viewed as an application of swarm intelligence, but it is questionable whether this swarm is self-organizing because 4096 neurons are required for self-organization in SOMs [Ultsch, 1999] and the conditions for self-organizing swarm behavior may not apply [Bonabeau et al., 1999, pp. 22- 25].

A third method is known as the swarm-inspired projection (SIP [Su et al., 2009], as briefly mentioned above. SIP is a PSO approach that is loosely related to foraging theory because it is inspired by the foraging behavior of doves. The authors report that the number of doves should be significantly smaller than the number of data points and need only be higher than the expected number of clusters. Because of the small number of agents used, it is questionable whether this swarm is self-organizing, but as a PSO approach, it is an example of swarm intelligence.

The fourth approach, SOP [Herrmann, 2011], was already introduced. In terms of swarm behavior, the SOP algorithm does not consider collision avoidance (see the second section of this chapter), as seen from the fact that one or more DataBots may occupy the same position. After an annealing process, the SOP agents are uniformly distributed [Herrmann, 2011, pp. 68-69]; thus, the principle of flock centering is also disregarded. In the next chapter, it will be shown that the SOP algorithm also does not necessarily exhibit the property of *fluctuations* (referred to in the next section as *randomness*) because the position choices of the DataBots are predictable because of their self-interaction and the oblique neighborhood definition. In summary, SOP is a self-organizing swarm of DataBots based on Schelling's idea to unsupervised machine learning that cannot be regarded as an example of swarm intelligence.

Because ABC methods can be reduced to one ant, these approaches are classified as basic swarms. To exhibit swarm intelligence, a swarm must contain more than one independent agent. Therefore, LF [Lumer/Faieta, 1994] and its derivatives45 ATTA-TM [Handl et al., 2006] and ASM [Xu et al., 2007] are not applications of swarm intelligence. Notably, the argument presented here is only valid for ABC methods of unsupervised learning; the categorization may prove invalid for other ACO methods that are supervised.

<sup>45</sup> The fifth, sixth and seventh applications of unsupervised learning.

The discussion presented in this section is summarized in Figure 7.4, in which only projection methods are explicitly listed. All of the various methods used for clustering cannot be illustrated in one figure. Thus, only general hybrid types are depicted. For all of the publications mentioned above, there is currently no open-source code46 available except for applications of rule-based classification [Martens et al., 2011].

Figure 7.4: Types of swarm algorithms used in unsupervised learning. Pswarm will be introduced in the next chapter; it combines self-organization with swarm intelligence. Various PSO and bee hybrids are used for clustering tasks. Most of these are based on k-means. Aside from Schelling's segregation model, only projection methods are explicitly listed. Abbreviations: ant-based clustering (ABC), particle swarm optimization (PSO).

<sup>46</sup> The authors of [O'Neill/Brabazon, 2008; Su et al., 2009; Giraldo et al., 2011] were contacted via email, but only Giraldo et al. responded and provided their source code. Due to various limitations, it could not be used for this thesis.

#### **7.3 Missing Links: Emergence and Game Theory**

Through self-organization, novel and irreducible47 structures, patterns, and properties can emerge in a complex system [Goldstein, 1999]. In analogy to SOMs [Ultsch, 1999], this idiosyncratic behavior of a swarm is defined here as *emergence* (see also [Stephan, 1999]).

Sometimes, a distinction is made between strong and weak emergence [Janich/Duncker, 2011, p. 19]. Here, only strong emergence is relevant. In the literature, the existence of emergence is controversial48; it is possible that the concept is only required because the causal explanations for certain phenomena have not yet been found [Janich/Duncker, 2011, p. 23]. Figure 7.5 presents an example of emergence in swarms. The non-deterministic movement of fish is temporarily and structurally unpredictable and consists of many interactions among many agents. Nevertheless, this fish school forms a ball-like formation.

It appears that the concept of emergence has remained unused and rarely discussed in the literature on swarm intelligence, although it is a key concept in AI [Brooks, 1991]. Emergence is mentioned in the literature as a biological aspect of swarms [Garnier et al., 2007], in distributed AI for complex optimization problems [Bogon, 2013, p. 19],in the context of software systems [Bogon, 2013, p. 19] (citing [Timm, 2006]) and as emergent computation [Forest, 1990]. Contrary to Forest, who assumes that only cooperative behavior can lead to emergence [Forest, 1990, p. 8], this works shows that egoistic behavior of a swarm can lead to emergence as well (see chapter 8). With regard to swarms, emergence should be a key concept. The four factors leading to emergence in swarms are


[Bonabeau et al., 1999, p. 23] agrees with [Ultsch, 1999, 2007] regarding the first factor: "*Randomness* is often crucial, since it enables the discovery of new solutions, and *fluctuations* can act as seeds from which structures nucleate and grow." Here, an algorithm is considered to have the property of *randomness* if it uses a source of random numbers in its calculations (nondeterminism) [Ultsch, 2007]. The power of randomness is evident in Schelling's segregation model (Fig 3.).

The second factor, *unpredictability* [Ultsch, 2007, O'Connor/Wong, 2015], is incompatible with the PSO approach, in which an objective function is optimized [Martens et al., 2011] and, therefore, predictable assumptions are implicitly made regarding the structures of data sets in the case of unsupervised machine learning (see chapter 4 for further details on projection methods). The third factor, multiple interactions among many agents, was identified by [Forest, 1990, pp. 1-2] for nonlinear systems. Although [Bonabeau et al., 1999] defines a requirement of multiple interactions for self-organization, the authors argue on page 24 that a single agent may also be sufficient. This is not the case for emergence, for which many elementary processes are mandatory [Beni, 2004; Ultsch, 1999]. Hence, ACO methods cannot exhibit the property of emergence Nonlinearity means that adding or removing interactions among agents or any agents

<sup>47</sup> There is no way to derive the property from any part, subset or partial structure of the system [Ultsch, 2007].

<sup>48</sup> For applications, the existence of emergence is irrelevant. Even if emergent phenomena can be causally explained, they can still be used in the future (see [Stephan, 1999] for discussion).

Figure 7.5: A fish swarm in the form of a ball [Uber\_Pix, 2015]: an example of emergence in swarms. It illustrates the ability of a system to produce phenomena on a new, higher level.

themselves results in behavior that is linearly unpredictable. For example, the removal of one DataBot results in the elimination of one data point.

The fourth factor, *Irreducibility* [Kim, 2006, p. 555, Ultsch, 2007, O'Connor/Wong, 2015], means that the (novel) property cannot be derived from any agent (or part) of the system, but is only a property of the whole system. It is the ability of a system to produce phenomena on a new, higher level [Ultsch, 1999]. Vividly, it mark a distinction between the self-organization in Figure 7.2, where essentially a pattern of a snow flake could be derived by the physical properties and chemical bonds of ܪଶܱ and Figure 7.5, where the formation of a ball cannot be predicted from any fish itself.

The second missing link is a connection to game theory, in which the four axioms of selforganization — *positive* and *negative feedback*, *amplification of fluctuations* and *multiple interactions* — are apparent. Game theory was introduced by [Neumann/Morgenstern] in 1947. The purpose of game theory is to model situations49 in which multiple players interact with each other or affect each other's outcomes [Nisan et al., 2007, p. 3] (*multiple interactions*). Here, the focus lies on a general, not zero-sum, n-person game [Neumann/Morgenstern, 1953, p. 85]. A game is defined as a scenario with n players i=1, …, n in which each player makes a choice [Neumann/Morgenstern, 1953, p. 84] (*amplification of fluctuations*50).

Let a game *G* be defined by *n* players associated with *n* non-empty sets Πଵ,…,Π, where every set Π represent all choices made by player ݅; then, the pay-off function is defined as

$$p = (p\_1, \ldots, p\_n) \colon \Pi\_1 \times \ldots \times \Pi\_n \to \mathbb{R}^n \tag{7.2}$$

<sup>49</sup> To be more specific, rational decision-making behavior in social conflict situations.

<sup>50</sup> Task switching.

The choices of each player determine the outcome for each player, and the outcome will, in general, be different for different players [Nisan et al., 2007, p. 9]. In a game, the payoff for each player depends on not only his own choices but also the choices of all other players [Nisan et al., 2007, p. 9] (*positive* and *negative feedback*). Often, the choices are defined based on a set of mixed strategies for each player. From the biological point of view, these mixed strategies may include the five main principles of collective behavior: *Homogeneity*, *Locality*, *Velocity Matching*, *Collision Avoidance*, and *Flock Centering* [Grosan et al., 2006].

In a game with *n* players, let the k choices of player ݅ be defined by a set Π ൌ ሼߨଵ ఈߨ..., ߨ,..., ሽ, where ߨఈ indicates the ݅ ௧ player's ߙ௧ choice; then, a mixed strategy ݏሺ݅ሻ ∈ ܵ for player ݅ is defined by

$$\mathbf{s}\_{\rangle}(i) = \sum\_{a=1}^{k(l)} \mathbf{c}\_{a}(i)\boldsymbol{\pi}\_{a}(i) \tag{7.3}$$

where <sup>∑</sup> ܿఈሺ݅ሻ ൌ 1 ሺሻ ఈୀଵand all ܿఈሺ݅ሻ 0.

For noncooperative games, [Nash, 1951] proved the existence of at least one equilibrium point. Let ݐሺ݅ሻ ∈ ܵ be the mixed strategy that maximizes the payoff for player ݅; then, the Nash equilibrium is defined as

$$p\_l(\mathbf{s}(1), \dots, \mathbf{s}(l-1), \mathbf{t}\_j(l), \mathbf{s}(l+1), \dots, \mathbf{s}(n)) = \max\_{\mathbf{t}\_j(l) \in \mathcal{S}\_l} p\_l(\mathbf{s}(1), \dots, \mathbf{s}(n)) \tag{7.4}$$

if and only if this equation holds for every ݅ [Nash, 1951]. The mixed strategy ݐሺ݅ሻ ∈ ܵ is the equilibrium point if no deviation in strategy by any single person results in a greater profit for that person. A Nash equilibrium is called *weak* if multiple mixed strategies ݐሺ݅ሻ ∈ ܵ for the same person exist in equation (4) that result in the same maximal payoff , whereas in a *strong* Nash equilibrium, even a coalition of players cannot further increase their payoffs by simultaneously changing their strategies ݐሺ݅ሻ ∈ ܵ, ݅ ൌ 1. . . ݉ ݊, in (4). An illustrative example is the prisoner's dilemma [Poundstone, 1992]. Because of the interactions among the mixed strategies of all players that govern the payoff for a single player, the Nash equilibrium is not necessarily unique, and multiple different equilibria could exist.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. **Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International

# **8 Databionic Swarm (DBS)**

This chapter introduces a new concept for the use of swarm intelligence. It makes use of insights from the previous chapter and proposes a projection method based on a swarm of intelligent agents called DataBots [Ultsch, 2000c]. This new swarm is called a polar swarm (Pswarm) because its agents move in polar coordinates based on symmetry considerations (see [Feynman et al., 2007, pp. 147-153, 745]). All parameters are automatically chosen according to, and directly based on, the appropriate high-dimensional definition of distance. The main idea of Pswarm is to combine the concepts of swarm intelligence and self-organization with non-cooperative game theory [Nash, 1950]. The main advance is the reliance on the concept of emergence [Ultsch, 2007] instead of the optimization of an objective function. This allows Pswarm to preserve structures in data sets that are characterized by discontinuity.

The extensive analysis of ant-based clustering (ABC) methods that has been performed in previous work allows the formulation of a precise mathematical definition of pheromonal stigmergy (a *scent*) [Herrmann/Ultsch, 2009]. The scent is defined in each neighborhood using an annealing scheme. The approach based on neighborhood reduction during the annealing process was invented by Kohonen [Kohonen, 1982b] and was used, for example, in [Demartines/Hérault, 1995; Hinton/Roweis, 2002; Ultsch, 1999]. In the context of swarm-based techniques, it was used for the first time in [Tsai et al., 2004]. Until now, finding the correct annealing scheme for a high-dimensional data set has remained a challenging task [Nybo et al., 2007]. The Pswarm algorithm utilizes randomness and the Nash equilibrium [Nash] of non-cooperative game theory to find an appropriate annealing scheme based on the data as given in the input space. For this purpose, the scent will be redefined as the payoff function51.

Having projected the high-dimensional points into two dimensions using Pswarm in section 8.1, the author applies the insights from chapters 4 and 5, particularly with regard to the generalized U-matrix, to propose a three-dimensional topographic map with hypsometric tints [Thrun et al., 2016a] based on the high-dimensional distances and the density of the two-dimensional projected points. Drawing further insights from [Lötsch/Ultsch, 2014], a semi-interactive, but parameter-insensitive, clustering approach is possible. The framework as a whole is called Databionic swarm (DBS) and has only two parameters: the number of clusters and the type of clustering (connected or compact). The key feature of DBS is that neither an overall objective function for the process nor the type of clusters sought is explicitly defined at any point during the Pswarm process. Both parameters can be deduced from a topographic map of the Pswarm projection and a dendrogram. For DBS clustering and Pswarm projection the CRAN R package Databionic swarm was used [Thrun, 2017].

# **8.1 Projection with Pswarm**

This section introduces the Polar swarm (Pswarm algorithm, which is the key foundation for the clustering performed in the DBS framework. Although the entire algorithm is used in an interactive clustering approach, Pswarm by itself may be used as a projection method. Because

M. C. Thrun, *Projection-Based Clustering through Self-Organization and Swarm Intelligence*, https://doi.org/10.1007/978-3-658-20540-9\_8

<sup>51</sup> However, DataBots will still be described as "smelling" their surroundings.

this enables direct comparison with the swarm-organized projection (SOP) algorithm, Pswarm is introduced and discussed separately from DBS.

The analysis presented in the second section of this chapter strongly indicates that Pswarm outperforms SOP in terms of structure preservation by virtue of the property of emergence arising from its self-organizing collective behavior (see also chapter 10, section 3). In contrast to SOP and all other common projection methods [Venna/Kaski, 2007; Venna et al., 2010], Pswarm does not require any input parameters other than the data set of interest, in which case Euclidean distances are used in the input space. Alternatively, a user may also provide Pswarm with a matrix defined in terms of a particular dissimilarity measure, which is typically a distance but may also be a non-metric measure.

#### *8.1.1 Motivation: Game Theory*

The purpose of game theory is to model situations in which multiple players interact with each other and/or affect each other's outcomes [Nisan et al., 2007, p. 3]. The author of this thesis focuses on a general, not zero-sum*,* non-cooperative game of n players [Neumann/Morgenstern, 1953, p. 85] in which the choices each player makes determine the outcome for each player [Nisan et al., 2007, p. 9]. For this kind of game, Nash proved the existence of at least one equilibrium point [Nash, 1951]. The payoff for each player depends on not only his own choices but also the choices made by all other players [Nisan et al., 2007, p. 9]. Often, these choices are defined based on a set of mixed strategies for each player.

The key idea of Pswarm is to redefine a game as one annealing step (epoch), the players as DataBots, and the scent as a payoff function and to find an equilibrium for each game. In the context of Pswarm, the game consists of rules governing the movement of the DataBots, which is defined by the grid, the neighborhoods and the payoff function. Each DataBot searches for its strongest payoff by either moving across the grid or staying in its current position. A new game (epoch), which is defined based on the considered neighborhood radius R, begins once an approximate equilibrium is achieved, i.e., once no movement of any DataBot leads to a stronger or better payoff for any other DataBot any longer (weak Nash equilibrium). This approach leads to a data-driven annealing scheme with steps which are not defined by parameters, contrary to SOP (e.g. *threshold\_max,* i\_max in Listing 7.1), CCA and ESOM (e.g. number of epochs) as well as NeRV52.

#### *8.1.2 Symmetry Considerations*

If we consider DataBots that occupy space in two dimensions, such as spheres or atoms, two points must be considered: first, no two DataBots are allowed to be in the same spot at the same time (*collision avoidance*), and second, a hexagonal lattice (tiling) is the densest possible packing of identical spheres in two dimensions [Hunklinger, 2009, p. 65]. Every such sphere represents a possible position for a DataBot. To ensure that the two-dimensional output space is used most efficiently, a hexagonal lattice tiling (*grid*) is used in Pswarm. To avoid problems associated with the surface of the grid, such as the positioning of DataBots near the border, the grid must have periodic boundary conditions and consequently must possess full *translational symmetry* [Haug/Koch, 2004, p. 34]. If the third dimension (e.g., as in a crystal) is disregarded, this two-dimensional grid can be represented by a three-dimensional torus [Pasquier, 1987], which

<sup>52</sup> e.g. iterations, cg\_steps, cg\_steps\_final in [Nybo/ Venna, 2015, Thrun et al., 2017].

is hereafter referred to as a *toroidal grid*. This means that the borders of the grid are cyclically connected. The periodicity of the grid is defined by its size in terms of the numbers of lines *L* and columns *C*. If the grid were *planar* (not toroidal), undesired boundary effects could affect the outcome of any method.

Boundary effects are effects related to the borders of the output space in which the patterns of interactions across the borders of the bounded region are ignored or distorted, giving rise to shape effects, such that the shape imposed on the planar output space affects the perceived interactions between phenomena (see [McDonnell, 1995]). For example, if the output space is planar, it is unknown whether a projected point on the left border is similar (or dissimilar, in this case) to a projected point on the right border. It could be that the projection method is constrained to split similar points (with regard to the input space) in the output space. Another example is the distorted interactions between DataBots on the four borders when the output space is planar. Compared with a planar output space, a toroidal output space imposes fewer constraints on a projection (or clustering) method53 and therefore enables a more optimized folding of the high-dimensional input space. A toroidal output space (in the case of Pswarm, a grid) possesses the advantage of translational symmetry in two dimensions, and in this case, the direction of a DataBot's movement is less important than its extent (length) because of the periodicity (of the grid).

In addition to the above considerations, the positions on the grid are coded using polar coordinates because the distances between DataBots on the grid will be most important in later computations of the neighborhoods and the annealing scheme. Consequently, based on the relevant *symmetry considerations,* a transformation of the Cartesian *(x, y)* coordinate system into polar coordinates ሺݎ,߶ ሻ ߳ *O* is proposed as follows:

$$r = \mathbf{x}^2 + \mathbf{y}^2 \qquad \text{(8.1)}$$

$$\phi = \tan^{-1}\left(\frac{\nu}{\varkappa}\right) \* \frac{\mathbf{180}}{\pi} \qquad \text{(8.2)}$$

Hereafter, *r* represents the length of a DataBot's movement (*jump*), and ߶ represents the direction of that movement.

Previously, the size of any grid (e.g., in SOP or emergent self-organizing map (ESOM)), as defined by the numbers of lines *L* and columns *C*, had to be chosen by the user. Choosing an incorrect size could result in a poor projection of the data. This was noted in previous works describing DataBot approaches prior to the development of the SOP algorithm [Kohlhof, 2010]. By contrast, in Pswarm, the grid size is chosen automatically, subject to three conditions. Let ܦ ෩be an upper triangle of the matrix of the input distances, let *N* be the number of DataBots, let ߙ be the number of possible jump positions, let ߚ ∋ ሺ0.5,1ሿ be a scaling factor, and let p99 and p01 denote the 99-th and first percentiles, respectively, of the distances; then, the conditions for determining the grid size are

$$\frac{\sqrt{\mathcal{C}^2 + L^2}}{1} \ge \frac{p\_{\text{99}}(\mathcal{D})}{p\_{01}(\tilde{D})} =: A \qquad\qquad \text{(I)}$$

$$L\*\mathcal{C} \ge a\*N \qquad\qquad \text{(II)}$$

<sup>53</sup> To the author's knowledge, only the emergent self-organizing map (ESOM) and the swarm-organized projection (SOP) method offer the option to switch between planar and toroidal spaces (see [Ultsch, 1999], [Herrmann, 2011, p. 98]).

$$\frac{L}{C} = \frac{\beta}{1} \qquad \qquad \qquad \text{(III)}$$

These conditions result in the following bi-quadratic equation:

$$\mathcal{C}^4 - A^2 \ast \mathcal{C}^2 + a^2 \ast N^2 = 0 \tag{8.3}$$

$$\mathbf{z}\_{1/2} = A^2 \pm \frac{1}{2} \sqrt{A^4 - \frac{a^2}{4}N^2}$$

$$= \gg \mathcal{C} = \left\{ \frac{1}{\sqrt{2}} \sqrt{A^2 + \sqrt{A^4 - \frac{a^2}{4}N^2}}, \ A^4 \ge \frac{a^2}{4}N^2 \right. \tag{8.4}$$

$$\text{approximation, } A < \frac{a^2}{4}N^2$$

The first condition ensures that the shortest and longest distances of interest are assignable to grid units. It defines the possible resolution of high-dimensional structures in the grid. The second condition ensures that there are sufficient available positions to which a DataBot can jump. The third condition causes the grid to be more rectangular than square because in the case of SOMs, "rectangular maps outperform square maps" [Ultsch/Herrmann, 2005]. The first two conditions are used to formulate the bi-quadratic equation under the assumption of equality (see

Eq. 8.4). If the equation has no solution for the case of ܣସ ൏ <sup>ఈ</sup><sup>మ</sup> <sup>ସ</sup> ܰଶ, then conditions I and III are used to generate approximate solutions. The scaling factor ߚ is arbitrary and used only to ensure a solution in the case of approximation but it is not a parameter which has to be chosen. In this solution space, a solution that fulfills condition II is chosen.

#### *8.1.3 Algorithm*

Several previously developed ideas are applied in Pswarm: scent54 [Herrmann/ Ultsch, 2008a], DataBots [Ultsch, 2000c] and the decreasing neighborhood radius proposed for DataBots by [Kämpf/Ultsch, 2006]. The decrease in the radius is based on the data and is not predefined by parameters, which was a goal of [Herrmann, 2011], where it was called selfadaptation. The underlying idea of the decreasing radius approach is to promote self-organization, first of a global structure and then of local structures [Kämpf/Ultsch, 2006].

The intelligent agents of Pswarm operate on a toroidal grid where the positions are coded using polar coordinates, ݅థሺݎሻ߳ *O.* This permits the DataBots' movement, the neighborhood function and the annealing scheme to be precisely defined. The numeric vector zj associated with each DataBot bj represents its distances from all other DataBots in the input space I. The outputspace distances are coded using only the polar coordinate *r*. The size of the squared-distance matrix D is defined by the number of DataBots.

After the assignment of initial random positions on the grid O (and therefore random output distances) to the DataBots in Listing 8.1**,** a data-driven decreasing of the radius R begins. In every iteration, a portion of the DataBots are allowed to jump if the payoff in one of their new positions is better (stronger) than that in their old positions. In other words, each DataBot is given a chance *c(R)* to try new positions on the grid.

The chance ܿሺܴሻ: Գ െ ሾ0.05,0.5ሿ is a continuous, monotonically decreasing linear function addressing the number of the DataBots which are allowed to search for a new position to jump

<sup>54</sup> Called topographic stress in [Herrmann/Ultsch, 2008].

to. Initially, many55 DataBots are allowed to jump simultaneously to reproduce the coarse structure of the high-dimensional data set. However, as the algorithm progresses to address finer structures, only a small number56 of DataBots may move simultaneously. The chance function depends on the number of DataBots and on the current radius ܴ and consequently is based on the data itself.

In Pswarm, the length of a possible DataBot jump is not reduced during annealing57. The possible jumps of DataBots to new positions are drawn from a uniform distribution; therefore, the probability of selection is the same for all possible jumps, from a jump to zero to a jump to *Rmax* in any direction. The direction of a jump to a new position is chosen separately from among all positions corresponding to an equal jump length. This approach prevents local minima from causing the DataBots to become stuck in an incorrect cluster because the length of their jump is smaller than half of the cluster's diameter. No DataBot is allowed to jump to an occupied position. Each DataBot may choose one of the four best different positions (ߙ = 4 (in different directions to which to jump if it is sampled for jumping. This approach ensures a high probability that every sampled DataBot will find a free position.

*function Positions O=Pswarm(matrix D(l, j))* 

 *for all* ݖ∋*I: assign an initial random polar position* ݅థሺݎሻ ∈ *O on the grid* 

 *to generate N DataBots* ܾ݅ ∈ ܤ

 *for R={Rmax=Lines/2,…,Rmin} do calculate chance c(R) Repeat for each iteration*  ܿ ൌ ݏ݈ܽ݉݁ ሺܿሺܴሻ, ܤሻ ݉ሺܿሻ ൌ ݑ݂݊݅ݎ݉ሺ1, ܴ݉ܽݔሻ*, with k=1,…,*ߙ*,* ݉ሺܿሻ ∈ ܱ ݈ሺܿሻ ൌ argmax ∈ሼ,ೖሺሻሽ ቀߣ൫ܾ, ܴ൯ቁ ݈ሺ! ܿሻ ൌ ݅ ܵ ൌ ߣሺܾ, ܴሻ ே ୀଵ  *Until*  ߲ܵሺ݁,ߣሺܴሻሻ ߲݁ ൌ 0

 *return O in Cartesian coordinates* 

#### *end function Pswarm*

Listing 8.1: The Pswarm algorithm consisting of *N* DataBots. New possible positions are depicted with ݉ሺሻሺܿሻ where k indicates up to the number of ߙ polar positions ݅థሺݎሻ chosen with an equal chance in the range from 1 up to *Rmax* (*uniform*) relative to the old position *i* and the old position with ݅ of a DataBot which has a chance ܿ to jump. After the decision to jump or not to jump the position is depicted with l(c). All other DataBots do not search for a new position depicted with ! ܿ and remain on their old position ݅. The data-driven annealing scheme (repeat/until) is parameter free due to the application of the Nash equilibrium of game theory (see 8.1.6).

<sup>55</sup> However, no more than half of the DataBots are allowed to search for a new position.

<sup>56</sup> At the end exactly five percent of all DataBots.

<sup>57</sup> Unlike in the SOP algorithm.

#### *8.1.4 Data-driven Annealing Scheme*

Let each annealing step be defined as an epoch *e*; then, a new epoch begins (and a game58 ends) if the radius *R* is reduced by the condition defined below.

Let ݎሺ݆, ݈ሻ be the one-dimensional distance from l ∈ *O* to j ∈ *O* in polar coordinates ሺݎ,߶ ሻ as specified by the radius ܴ; then, the neighborhood function "Cone" is defined as

$$h\_R \colon R\_e \to [0, 1] \colon$$

$$h\_R = \begin{cases} 1 - \frac{r(\boldsymbol{\jmath}, \boldsymbol{\jmath})^2}{n R\_e^2}, \; \boldsymbol{\jmath} \boldsymbol{f} \boldsymbol{f} \frac{r(\boldsymbol{\jmath}, \boldsymbol{\jmath})^2}{n R\_e^2} < 1\\ 0, \; \boldsymbol{\varrho} \boldsymbol{\epsilon} \boldsymbol{\epsilon} \boldsymbol{\varpi} \boldsymbol{\epsilon} \boldsymbol{\epsilon} \end{cases} \tag{8.5}$$

where ܴ݁ is the radius of the neighborhood during epoch *e*.

Let *D(l, j)* be the distance between x୪, x୨ ∈ I, and let ݎሺ݆, ݈ሻ be the one-dimensional radial distance in two-dimensional polar coordinates ሺݎ, ߮ሻ in the output space O; then, in Pswarm, the scent around a DataBot ܾ݆ is redefined to

$$\lambda\_e \{ b\_f, R\_e, S\_0 \} = \begin{cases} S\_0 - \frac{\sum\_{l \in I} h\_R \{ r(j, l) \} \* D(j, l)}{\sum\_{l \in I} h\_R \{ r(j, l) \}}, & \quad \text{iff } \sum\_{l \in W} h\_R \{ r(j, l) \} > 0 \\\ S\_0, & \quad \text{otherwise} \end{cases} \tag{8.6}$$

where

$$\mathcal{S}\_0 = \sum\_{j} |\lambda(b\_{j\prime} R\_{max\prime}, 0)|\tag{8.7}$$

Following the discussion in section 8.1.2, the scent ߣሺܾ, ܴሻ is identified as the payoff function ߣ൫ܾ, ܴ൯: Թ ା ൈܱ→Թ ା for a DataBot.

The high-dimensional input distances *D(l, j)* must be calculated only once, which is done prior to starting the algorithm, thereby reducing the computational cost. The computational cost of the algorithm does not depend on the dimension of the data set but does depend on the number of DataBots and the number of possible jump positions ߙ. Additionally, Pswarm allows the conversion of distances or dissimilarities into two-dimensional points.

Let *e* be the current epoch, let ܴ݁ be the current neighborhood radius, and let ܾ ∈ ܤ denote the DataBots; then, the sum of all payoffs is the current global happiness, which may be called the stress59 ܵሺ݁, ܴሻ, and is defined as

$$S(e, R\_e) = \sum\_{f} \lambda\_e(b\_f, R\_e) \tag{8.8}$$

The neighborhood is reduced if the derivative of the current global happiness is equal to zero:

$$\frac{\partial S(e, R\_e)}{\partial e} = 0 \qquad \qquad \text{(8.9)}$$

which is called the *equilibrium of happiness* condition. The neighborhood radius R is reduced from *Rmax* toward *Rmin* with a step size of 1 if the derivative of the sum of all payoffs ߣ is equal to zero. This is the case if a (weak) equilibrium for all DataBots is found.

Because not all DataBots are allowed to jump simultaneously during a single iteration, as imposed by the function ݏ݈ܽ݉݁ ሺܿሺܴሻ, ܤሻ, the DataBots are able to pay off their neighborhoods

<sup>58</sup> In the context of game theory.

<sup>59</sup> To simplify the comparison with SOP.

more often, thereby promoting the process of self-organization. By searching for an equilibrium, the net number of DataBots that would like to jump or are unhappy is irrelevant to the self-adaptive annealing process. Instead, the decision to shrink the neighborhood size or to proceed to the next epoch *e* is made based on a Nash equilibrium [Nash, 1950]. The criterion is clearly defined to correspond to the condition in which the global amount of happiness in the current epoch remains unchanged, which is defined as the *equilibrium of happiness*, డௌ డ ൌ 0. 

#### *8.1.5 Annealing Interval*

Rmax is equal to Lines/2 if Lines<Columns to prevent self-interaction of the DataBots. If the radius R were to be greater than Lines/2, then the neighborhood of a given DataBot would overlap with itself because of the toroidal nature of the grid. Moreover, the probability density function for choosing a new position cannot be uniformly (or Gaussian) distributed in this case because border positions can be reached from two directions ߶ on a toroidal grid.

Rmin is determined by the size of the grid and the number of DataBots. It is set to a value that allows every DataBot to smell a minimum of 5% of the other DataBots if they are distributed uniformly60. This selection is inspired by an emergent phenomenon called an ant mill [Schneirla, 1971, pp. 281-283]: Army ants are an aggressive, nomadic species, incessantly moving around. Based on its payoff, every ant follows another ant in front of it. If the head of the ant colony runs into the tail of the colony, the ants form a so-called circle of death, because they keep moving until they die. This phenomenon would not occur if the ants were able to smell a region farther ahead of them.

#### *8.1.6 Convergence*

In game theory, for a game with egoistic agents, a solution concept exists called the Nash equilibrium [Nash, 1950].

Let ሺܲ, Λሻ be a game with n DataBots ܾ, i ൌ 1, … , ܰ, where *P* is a set of movement strategies and Λ ൌ ሼλୣ,୧ሺܾ, ܴ ൌ ܿ݊ݐݏሻ|݅ ൌ 1, … , ܰሽ is the payoff function evaluated for every grid position ݓ ∋ ܲ. Each DataBot chooses a movement strategy consisting of a probability associated with a position on the grid. Upon deciding on a position, a DataBot receives the payoff defined by the scent. *P* is a set of mixed strategies that are chosen stochastically with fixed probability in the context of game theory. Nash proved that in this case, the following equilibrium exists:

$$\forall i. \, w\_{l}, b\_{l} \in P \colon \lambda\_{l}(b\_{l}^{\prime}) \ge \lambda\_{l}(b\_{l}) \tag{8.10}$$

The strategy ܾ݅ is the equilibrium, for which no deviation in strategy (position on the grid) by any single DataBot results in a greater profit for that DataBot. In the case of Pswarm, the Nash equilibrium is called weak because there may be more than one strategy with the same payoff for some DataBots. Because of the existence of this equilibrium, the Pswarm algorithm will always converge.

<sup>60</sup> Rmin (and Rmax) are chosen automatically by the Pswarm algorithm based on the gird size and consequently based on the data.

### **8.2 Comparing Pswarm with a Previously Developed Approach**

Although the entire algorithm is used in an interactive clustering approach that does not require any sensitive input parameters, in this section, Pswarm is treated as an independent projection method and is compared with swarm-organized projection (SOP, see also chapter 10, section 3).

It will be demonstrated that changing the coordinate system from Cartesian to polar coordinates enables precise and practical definitions of neighborhoods, stigmergy and distances in the output space. With this approach, by using the Nash equilibrium [Nash, 1950] and modifying the DataBots' movements, it is possible to deduce a parameter-free and data-driven annealing scheme. This section will show that the self-adaptive annealing scheme of SOP requires important parameters and is, in fact, not always self-adaptive, as opposed to the Pswarm algorithm.

### *8.2.1 Neighborhood Definition*

The main problem with regard to SOP lies in the neighborhood definition and annealing scheme of [Ultsch/Herrmann, 2010] and [Herrmann, 2011], as shown in Figure 8.1.

Because the lattice tiling is rectangular (quad grid), as is justified for Cartesian coordinates by [Ultsch/Herrmann, 2005], the neighborhoods are square and not round; this was explicitly defined in [Herrmann, 2011, p. 46] and remains unchanged in the SOP algorithm [Herrmann, 2011, pp. 64-70], and it is relevant to the scent ߣ) as defined in chapter 7.1 in Eq. 7.1).

In SOP, the following applies ݀1 ቀ݈, ݆ቁ ൌ ݀2ሺ݈, ݆ሻ, where these distances denote the lengths of jumps between *l, j=x, y* in Cartesian coordinates. This means that the probability of selecting a diagonal position for a DataBot jump is equal to that of selecting a horizontal/vertical position in the SOP lattice because the two-dimensional Gaussian neighborhood consists of two Gaussian functions, from which the vertical and horizontal coordinates are drawn separately to determine the chosen lattice positions: ܰሺ݉ሺݔሻ,ݏ ൌߪ ൌܴሻ ܰሺ݉ሺݕሻ, ݏ ൌ ߪ ൌ ܴሻ.

For the choice of new positions for the DataBots, Hermann proposed that the selection probability should a Gaussian [Herrmann, 2011, p. 64], where the center is the current position of the DataBot, m(x, y), and the standard deviation s [Ultsch/Herrmann, 2010, p. 3] is equal to the radius R. In [Ultsch/Herrmann, 2010], a two-dimensional Gaussian distribution ܰଶሺ݉, ݏሻ was mentioned, but a practical solution to the problem of how to implement a two-dimensional Gaussian distribution on a discrete lattice was not addressed [Herrmann, 2011]. Moreover, the neighborhood considered in [Herrmann, 2011, p. 64] was defined only on a finite lattice.

Figure 8.1: Neighborhood definition in the (rectangular) lattice tiling of a square shape of the SOP algorithm, adapted from [Herrmann, 2011, p. 47]. All positions defined at distances of less than or equal to r=2 are shown. Independent of the coordinate system, the SOP lattice is rectangular, with a size of (*L, C*).

Figure 8.2: A similar rectangular lattice tiling of a square shape in polar coordinates for comparison61. In Pswarm, it applies ݀ଵሺ݈, ݆ሻ ് ݀ଶሺ݈, ݆ሻ for j, l=r, ߶ in polar coordinates. All positions at distances smaller than or equal to r=2 are marked by gray squares. In this case, the neighborhood (Eq. 8.5) depends on a precise one-dimensional grid distance, and for Gaussian neighborhoods, jump positions can be drawn from ܰሺ݉ሺݎሻ, ݏ ൌ ܴሻ. Independent of the coordinate system, the Pswarm (hexagonal) grid has a rectangular shape of borders, with a size of (*L, C*).

On a toroidal grid or lattice (tiling), such a neighborhood will always overlap itself because Gaussian functions are never equal to zero. No solution for the case of a toroidal lattice was offered in [Herrmann, 2011]. Instead, in practice, the choice of a new DataBot position in the SOP algorithm is made by drawing separately from one normal distribution for the x coordinate

<sup>61</sup> In reality, Pswarm uses a hexagonal tiling instead of a rectangular tiling referenced as a grid.

and another normal distribution for the y coordinate, where the means are the corresponding coordinates of the current position and the standard deviations are equal to the radius62 *R*. However, the following inequality applies:

$$N(m(\mathbf{x}), \mathbf{s}) + N(m(\mathbf{y}), \mathbf{s}) \neq N^2(m(\mathbf{x}, \mathbf{y}), \mathbf{s})\tag{8.11}$$

Consequently, diagonal jumps are equal in length to horizontal and vertical jumps. However, [Bauer et al., 1999] argues that in a rectangular lattice, diagonal neighbors cannot be regarded as nearest neighbors. Moreover, the Gaussians overlap at the origin.

Based on *symmetry considerations,* a transformation from the Cartesian (x, y) coordinate system to the polar *(r,* ߶*)* coordinate system is exploited in Pswarm.

This allows Pswarm to use a more precise neighborhood definition with sharp borders in Eq. 8.5, as illustrated in Figure 8.2, and makes the calculation of Euclidean distances in the twodimensional output space unnecessary63. The neighborhood is defined only by the radius *r* of the polar coordinates. If the radius exceeds the borders of the toroidal grid, then the distance and jump length can be adapted using a modulus operation if drawn from uniform distributions. Allowing the maximum possible jump lengths prevents the algorithm from becoming trapped in local minima: if the jump length is too short, there is a possibility that the DataBots may be unhappy in their positions but unable to find new positions because no open positions exist.

In contrast to Pswarm, in SOP, the neighborhood definition for the scent ߣ remains vague. In [Herrmann, 2011, p. 63], it is stated that the development of SOP led to the revision of the ABC method based on Figure 8.1, where quadratic neighborhoods are explicitly defined [Herrmann, 2011, p. 46]. Still, this definition remained unchanged [Herrmann, 2011, pp. 64-70]. However, if the maximal radius is set to R>Lines/2 for Lines<Columns, then the Gaussian function ܨோ required to calculate the scent ߣ] Herrmann, 2011, p. 64] overlaps itself if no sharp borders are defined or if the grid or lattice is not finite (see chapter 7.1 Eq. 7.1). This overlap changes the weights of the output-space distances and the probabilities of choosing new positions to which to jump.

Additionally, the neighborhood of the lattice in which the DataBot is moving is defined by equal (square) diagonal and vertical jumps, but the two-dimensional distances on the lattice are defined as Euclidean distances (radial). These definitions are inconsistent with each other. Thus, the annealing scheme of the SOP algorithm is more square (jump length, position probability) than radial (output-space Euclidean distance). In summary, the use of Gaussian functions prevents the possibility of precisely defining the DataBot jump length and neighborhood, and worse, the jump length and neighborhood are not consistent with the output distances; see Figure 8.1.

More importantly, the radius *R* does not define a border for the SOP neighborhood; instead, it defines only the standard deviation of the density of a normal distribution. This results in very large neighborhoods without sharp borders. The adaptation of this neighborhood definition for a toroidal lattice was not addressed, and if the definitions of [Herrmann, 2011] were to be used on a toroidal lattice without modification, this would lead to significant mistakes.

Consequently, the definition of the scent ߣ is not consistent because the Euclidean output-space distance definition is inconsistent with the neighborhood definition.

<sup>62</sup> Taken from [Kohlhof, 2010] and Lutz Herrmann's 2011 Java implementation.

<sup>63</sup> A spherical coordinate system is the appropriate extension for a three-dimensional system.

Only a polar coordinate approach, such as that used in Pswarm, allows the selection of a neighborhood function ݄ோ that precisely defines the neighborhood borders (Eq. 8.5). Moreover, the computational effort needed to calculate the output-space distances from one DataBot to all others is reduced in such an approach because it is sufficient to look up radii coded in hash tables.

#### *8.2.2 Annealing Scheme*

The second problem with the SOP algorithm lies in the annealing scheme itself, which is not self-adaptive, as is claimed in [Herrmann, 2011]. This is because it is governed by two magic numbers: a threshold in terms of the number of DataBots that are allowed to jump and the maximum number of iterations after which an epoch ends given that this arbitrary threshold is exceeded in every iteration. The term "magic" indicates that these numbers are not derived from data but instead must be carefully chosen by an experienced user.

Only if the number of DataBots that want to jump exceeds a certain threshold value, called a fixed point in [Herrmann, 2011], will another iteration of the current epoch start. Otherwise, a new epoch with a smaller radius begins. This threshold value is required in SOP because the following case was not sufficiently considered: Often, as a result of a jump of one DataBot, not only will the scent of that DataBot change, but so will those of all the other DataBots in its new neighborhood and, more importantly, its old neighborhood. Because all DataBots are allowed to jump simultaneously, the DataBots are unable to update their scents sufficiently quickly in response to the changes occurring around them before they jump themselves; the scent at a possible new position is compared with an outdated (incorrect) scent at the current position, because the scent at the current position will have changed as a result of the jumps of other DataBots. This may result in random jumping.

In addition, if the scents at their current positions become worse, other DataBots will become unhappy. Therefore, on the one hand, they should also be allowed to jump, but on the other hand, allowing these DataBots to jump could trigger a cyclic process in which the DataBots simply follow each other. There is also a possibility that when DataBots are unhappy with their current positions, they may be unable to find new ones. Either no open positions may exist, or the scents at all other positions in the small circle around the DataBot itself may be even worse. This occurs because in a Gaussian distribution, there is a very high probability of making only small jumps and an exponentially lower probability of making larger jumps.

To summarize, these problems are intrinsic to the SOP algorithm and are unrelated to the sparse probabilistic movements of the agents, as claimed by [Herrmann, 2011, p. 66].

Another problem with the annealing process in SOP is the assumption that the stress Sሺߣ, ݁ሻ will be decreased only through iterations (Fig. 4.3 in [Herrmann, 2011, p. 69]) in which the DataBots move.

If the neighborhood function ܨோ is chosen to be a Gaussian distribution, then a smaller radius implies a reduction of the neighborhood function, i.e., Rଵ ൏ Rଶ ൌ Fୖభ ൏ Fୖమ. , because the standard deviation is defined by the radius. As shown by the curve in Fig. 4.3 in [Herrmann, 2011, p. 69], the sum of the scent64 in a neighborhood (in Hermann's thesis, this is called the

<sup>64</sup> Defined in chapter 7.1.

sum of (topographic) stress) therefore also decreases because for lower values of the neighborhood function ܨோ, the scent64 values and, consequently, the stress S must be lower:

ܨோభ ܨோమ ൌ ߣሺܴଵሻ ൏ ߣሺܴଶሻ ൌ ܵሺܴଵሻ ൏ ܵሺܴଶሻ*.* Only if the iterations are within the same epoch (with a constant radius *R*) must a reduction in stress be driven by DataBot movement*.* Therefore, applying *argmin* between scent64 values associated with different neighborhood radii results in random jumping of the DataBots.

Furthermore, the annealing scheme appears to reduce the stress S until convergence is reached (see Fig. 4.3 in [Herrmann, 2011, p. 69]). However, defining the scent64 and ܴ݉݅݊ ൌ 1 for the SOP algorithm as proposed by Herrmann results in ߣൌ∞ if there are no other DataBots in the neighborhood of a jumping DataBot. Even worse, this could lead to random jumping if, for example, two simultaneously jumping DataBots can smell only themselves when changing positions or if a reduction in the scent is only an effect of a reduction in the number of DataBots in the neighborhood.

By contrast, in Eq. 8.6 the payoff ߣሺܾ, ܴሻ considered in Pswarm was modified based on symmetry considerations, because the two-dimensional output-space distances are irrelevant if the coordinate system is polar. In this case, it is sufficient simply to use radii, and thus, it is not necessary to simulate radial neighborhoods by means of expensive computations using a Gaussian neighborhood function. Pswarm allows the definition of a sharp, radial, and deterministic neighborhood function (called Cone, Eq. 8.5) instead of the blurry, squarer than radial, and stochastic neighborhood of SOP.

In Pswarm, the "fixed point condition" of [Herrmann, 2011] is replaced with the equilibrium of happiness, డௌ డ ൌ 0 in Eq. 8.8. The use of the derivative makes it possible, during an epoch with a specific radius R, to find an iteration in which changes to the positions of some unhappy DataBots will not change the global happiness of all DataBots. In other words, an unhappy DataBot may jump to a new, more profitable position to become happier, but the DataBots surrounding its old position will simultaneously be left with less profitable positions and, in turn, become unhappier. This results in a kind of equilibrium in which, on the global scale of the toroidal plane65, the DataBots are incapable of finding more profitable positions.

When the DataBots are not allowed to jump simultaneously, they are able to detect the payoffs related to other DataBots in their current positions before deciding to jump. By allowing all DataBots to jump in every iteration, as in SOP, the process of finding emergent structures could be delayed or even destroyed.

On a toroidal grid, setting the maximal neighborhood radius to the maximal distance on the grid results in self-interaction of the DataBots: the probabilities of choosing a new position will overlap for radii that extend beyond the closer edge of the grid (*R>Lines/2* if *Lines<Columns*). Moreover, the neighborhood of one DataBot will overlap with itself, which will result in an incorrect calculation of the payoff and disrupt the process of emergence. Furthermore, the (maximal) neighborhood radius R in SOP is determined based on the architecture of the latticeshaped output space [Herrmann, 2011, p. 138], which was set to a constant value of 64x64 in the cited thesis regardless of the specific structures of the various data sets to be analyzed.

<sup>65</sup> This statement is only true if the possible jump length does not decrease with the neighborhood size.

Using Schelling's model in SOP is difficult because the dependence on chance, the data and the parameter settings causes an enormous number of iterations to be required [Hatna/Benenson, 2012] for the separation of the DataBots. Consequently, the number of iterations must be limited, and a threshold must be set on the number of jumping DataBots. Additionally, the attempt to find the minimum scent between two possible positions results in the problems discussed above. By contrast, Pswarm exploits the Nash equilibrium concept [Nash, 1950] based on the redefinition of scent as a payoff function ߣ and important changes to the neighborhood definition. This results in an annealing scheme that is based on the data.

In conclusion, SOP requires the user to choose a lattice size, two magic numbers for the annealing process and, in some cases, a minimal radius, whereas Pswarm does not. Additionally, the annealing scheme of Pswarm is fully radial with sound neighborhoods, whereas the neighborhood definition and annealing process of SOP are inconsistent with each other, which could prevent effective self-organization and, thus, emergence (examples in chapter 10.3).

#### *8.2.3 Swarm Intelligence and Self-Organization*

As described in the previous chapter, swarm behavior is characterized by five main principles [Grosan et al., 2006]: *Homogeneity*, *Locality, Collision Avoidance, Velocity Matching* and *Flock Centering*. In Pswarm, every agent is based on a DataBot, and the motion of each DataBot is influenced only by a well-defined neighborhood in which no two DataBots can be located in the same place at the same time. Hence, the first three main principles are obviously used. Velocity is defined as the rate of change in position with respect to time.

Considering fluctuations due to randomness, the average change in position is defined as Δܴത ൌ <sup>ଵ</sup> <sup>ଶ</sup> ቚ0.5 െ ௦ <sup>ଶ</sup> ቚ ൌ ௦ିଵ <sup>ସ</sup> because the DataBots can jump with uniform probability to positions at distances ranging from 0.5 to ௦ <sup>ଶ</sup> units of length and the relevant time interval is one iteration (within an epoch). Therefore, on average, the agents in Pswarm exhibit velocity matching. Flock centering, in our case, refers to centering around more than one flock, if a flock is understood to have the figurative meaning of a group of similar agents. In summary, all five principles of swarm behavior are represented in Pswarm. For the simplified definition of intelligence reduced to behavior, as presented in the last chapter, Pswarm therefore uses *swarm intelligence*.

Self-organization relies on four principles [Bonabeau et al., 1999]: *positive* and *negative feedback*, *amplification of fluctuations* and *multiple interactions.* Fluctuations appear because of the random jump lengths and the random choices of new DataBot positions. Multiple interactions among DataBots are required for stigmergy in a given neighborhood in which various DataBots are present. Positive feedback and negative feedback are reflected in the choices of a DataBot to not jump when it is "happy" and to jump when it is "unhappy". Moreover, the number of DataBots cannot be reduced because each DataBot represents one data point in the data set. Consequently, self-organization is a property of Pswarm if the data set of interest contains more than 100 high-dimensional data points. Because of the randomness of the choice of possible jump positions, the system is temporally and structurally unpredictable, and Pswarm exhibits multiple interactions among many agents. The property of *irreducibility* is shown through the found compact and connected structures (chapter 10-12). Therefore, this system of DataBots possesses the property of emergence, as defined in chapter 7.3.

#### **8.3 Clustering on a Generalized U\*-Matrix**

Chapter 4 introduces a generalized U\*-matrix visualization called topographic map that can be used for any projection method. The U\*-matrix represents high-dimensional density- and distance-based structures and is visualized as a topographic map with hypsometric tints [Thrun et al., 2016a]. Chapter 4 explains the connection between an approximation made by the simplified ESOM (sESOM) algorithm and an abstract U-matrix (AU-matrix) [Lötsch/Ultsch, 2014]. The clustering approach here uses the idea applied for the ESOM method that the abstract U\* matrix can be used for hierarchical clustering [Ultsch et al., 2016a].

Here, Pswarm, the AU-matrix concept and the proposed visualization are combined in the DBS clustering approach. In contrast to SOP and ESOM, this semi-interactive approach does not require any parameters other than the number of clusters and the cluster structure, which is either connected or compact (for details, see chapter 3). The number of clusters and the cluster structure can be estimated by counting the valleys in a topographic map and from a dendrogram. If the number of clusters and the clustering method are chosen correctly, then the clusters will be well separated by mountains in the visualization. Outliers are represented as volcanoes and can be interactively marked in the visualization after the automated clustering process.

The distances required for hierarchical clustering are defined by the AU-matrix, which was introduced in [Lötsch/Ultsch, 2014] for the U-matrix of a SOM. Here, the AU-matrix itself is defined by the Pswarm projection. In principle, the approach described in this section can be used for clustering based on any projection method because it is possible to generate a generalized U-matrix for any projection method (see chapter 5).

Let Gሺl, j, ࣞሻ be the minimum of all possible path distances p୨,ଵ between a pair of points ሼ݆, 1ሽ ∈ ܱ in the output space, as defined in chapter 2; then, the graph ࣞ is defined as the Delaunay graph weighted by the high-dimensional Euclidean distances between the points ሼ݆, 1ሽ ∈ I in the input space. In every direct neighborhood ܪሺ݇ ൌ 1, ࣞ, ܱሻ, all direct connections from the points *l* to the point *j* in the output space are weighted using the input-space distances D(l, j). In comparison to the ESOM clustering method proposed in [Ultsch et al., 2016a], here the shortest paths Gሺl, j, ࣞሻ are calculated additionally using the algorithm of [Dijkstra, 1959]. Contrary to [Ultsch et al., 2016a], the DBS clustering is not based on density information coded in the Pmatrix, because Pswarm itself is already able to project density-based structures (e.g. projection of EngyTime in chapter 10.3, Figure 10.7).

For example, in Figure 8.3, there are two well-separated clusters (green and blue), which the compact DBS clustering can detect in the dendrogram (Figure 8.4, left). In fact, the dendrogram could indicate also three or four clusters, but this is not verified by the visualization. If three or four clusters were chosen, the DBS clustering algorithm would not label points in the same cluster with the same color because they would not be well separated by mountains. The cluster heatmap shown in Figure 8.4 (right) verifies the clustering result of two clusters.

The outliers in a data set may be manually identified by the user. In this case, choosing the connected structure option for the clustering process would result in the automatic detection of all outliers. However, this option does not always lead to the detection of the main clusters in terms of the Gሺl, j, ࣞሻ distances. A second example of outlier detection is presented in chapter 10 using the Tetragonula data set [Franck et al., 2004].

Figure 8.3: DBS visualization as a topographic map of the Target data set of [Ultsch, 2005a]. Two main clusters are shown; the cluster labeled in green has a higher density than the cluster labeled in blue. The outliers (orange, yellow, magenta and cyan) lie in volcanoes.

Figure 8.4: The dendrogram (left) of Target data set generated using the Ward algorithm shows either two or four clusters; however, in Figure 8.3, only two clusters are visible. The heatmap of the Target data set (right) shows two separated clusters with some outliers, because the intracluster distances are distinctively smaller than the intercluster distances.

License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. **Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **9 Experimental Methodology**

This chapter describes all the data sets used in the results chapter and the parameter settings for the various methods. In the final section, brief overviews of the Gene Ontology (GO) database and overrepresentation analysis (ORA) are provided. For general distribution analyses, the CRAN R package AdaptGauss [Thrun/Ultsch, 2015; Ultsch et al., 2015] was used. For the topographic map and island visualization the CRAN R package GeneralizedUmatrix was used [Thrun/Ultsch, 2017b]. For the ABC analysis the CRAN R package ABCanalysis was used [Thrun et al. 2015]. For DBS clustering and Pswarm projection the CRAN R package Databionic swarm was used [Thrun, 2017].

# **9.1 Data Sets**

For the comparison of Pswarm as a projection method with the swarm-organized projection (SOP) algorithm, the original data sets of [Herrmann, 2011] were used. The artificial data sets of the Fundamental Clustering Problems Suite (FCPS) [Ultsch, 2005a] are summarized in Tab. 1 with regard to the cluster structures discussed in chapter 2.

 *"The FCPS comprises a collection of intentionally simple data sets with known classifications offering a variety of problems on which the performance of clustering algorithms can be tested. The data sets in the FCPS are specially designed such that the performance of clustering algorithms on particular challenges, for example, outliers or density- vs. distance-defined clusters, can be tested" [Ultsch/Lötsch, 2016, p. 4].* 

All FCPS data sets have uniquely unambiguously defined class labels. For the error rate is defined as 1-Accuracy (Eq. 3.1 on p. 29) was is used as a sum over all true positive labeled data points by the clustering algorithm. The best of all permutation of labels of the clustering algorithm regarding the accuracy was chosen in every trial, because the labels are arbitrarily defined by the algorithms.

Additional data sets that are used in later chapters are also described below in alphabetical order. If these data sets are not discussed directly in chapter 10 and 11 than please see to Supplement C and D where the clusterings and the visualizations of DBS are shown. The hydrology data set and the pain genes data set are separately introduced in chapter 12.

### *9.1.1 Atom*

*"The Atom data set [Ultsch, 2005c] consists of two clusters in* Թଷ*. The first cluster is completely enclosed by the second one and, therefore, cannot be separated by linear decision boundaries. Additionally, both clusters have different densities and variances. The Atom data set consists of a dense core of 400 points surrounded by a well separated, but sparse hull of 400 points. Both clusters are not linearly separable and many algorithms cannot construct a cohesive projection. The core is located in the center of the hull, which, for some methods based on averaging, makes it hard to solve it. The density of the core is much higher than the density in the hull. For data in the hull, some of the inner-cluster distances are bigger than the distance to the other clusters. The data set was not preprocessed" [Herrmann, 2011, pp. 99-100].* 

# *9.1.2 Chainlink*

The Chainlink data set [Ultsch, 1995; Ultsch et al., 1994] consists of two clusters in Թଷ. Together, the two clusters form intricate links of a chain, and therefore, they cannot be separated by linear decision boundaries [Herrmann, 2011, pp. 99-100]. The rings are cohesive in Թଷ; however, many projections are not. This data set serves as an excellent demonstration of several challenges facing projection methods: The data lie on two well-separated manifolds such that the global proximities contradict the local ones in the sense that the center of each ring is closer to some elements of the other cluster than to elements of its own cluster [Herrmann, 2011, pp. 99-100]. The two rings are intertwined in Թଷ and have the same average distances and densities. The data set was not preprocessed [Herrmann, 2011, pp. 99-100]. Every cluster contains 500 points.

# *9.1.3 EngyTime*

The EngyTime data set [Baggenstoss, 2002] contains 4,096 points belonging to two clusters in Թଶ; the data set is typical for sonar applications with the variables "Engy" and "Time" as a twodimensional mixture of Gaussians. The clusters overlap, and cluster borders can be defined only by using density information. There is no empty space between the clusters. The data set was not preprocessed [Herrmann, 2011, pp. 99-100].

# *9.1.4 Golf Ball*

The Golf Ball data set "consists of an artificial data set with 4,002 points, resembling a 3-D view of a golf ball" [Ultsch/Lötsch, 2016, p. 3]. "The points are located on the surface of a sphere at equal distances from each of the six nearest neighbors" [Ultsch/Lötsch, 2016, p. 4]. This data set does not contain any natural clusters. The data set was not preprocessed.

# *9.1.5 Hepta*

The Hepta data set [Ultsch, 2003a] is used to illustrate the general problems with quality measures (QMs) and projections from the perspective of structure preservation. The three-dimensional Hepta data set consists of seven clusters that are clearly separated by distance, one of which has a much higher density. The data set consists of 212 points, comprising seven clusters of thirty points each plus two additional points in the center cluster. The centroids of the clusters span the coordinate axes of Թଷ. The density of the central cluster is almost twice as high as the density of the other six clusters. The structure of the data set is clearly defined by distances and is compact. The data set was not preprocessed.

# *9.1.6 Iris*

*"Anderson's [Anderson, 1935] Iris data set was made famous by Fisher [Fisher, 1936], who used it to exemplify his linear discriminant analysis. It has since served to demonstrate the performance of many clustering algorithms" [G. Ritter, 2014, p. 220].* 

The Iris data set consists of data points in Թସ with a prior classification and describes the geographic variation of *Iris* flowers. The data set consists of 50 samples from each of three species of *Iris* flowers, namely, Iris setosa, Iris virginica and Iris versicolor. Four features were measured for each sample: the lengths and widths of the sepals and petals (see [Herrmann, 2011, pp. 99-100]). The observations have "only two digits of precision preventing general position of the data" [G. Ritter, 2014, p. 220] and "observations 102 and 142 are even equal" [G. Ritter, 2014, p. 220]. The *I.* setosa cluster is well separated, whereas the *I.* virginica and *I.* versicolor clusters slightly overlap (see [Herrmann, 2011, pp. 99-100]). This presents "a challenge for any sensitive classifier" [G. Ritter, 2014, p. 220]. The data set was not preprocessed (see [Herrmann, 2011, pp. 99-100]).

## *9.1.7 Leukemia*

The anonymized leukemia data set consists of 12,692 gene expressions66 from 554 subjects and is available from a previous publication [Haferlach et al., 2010]. Each gene expression is a logarithmic luminance intensity (presence call), which was measured using Affymetrix technology. The presence calls are related to the number of specific RNAs in a cell, which signals how active a specific gene is. Of the subjects, 109 were **healthy**, 15 were diagnosed with acute promyelocytic leukemia (**APL**), 266 had chronic lymphocytic leukemia (**CLL**), and 164 had acute myeloid leukemia (**AML**). "The study design adhered to the tenets of the Declaration of Helsinki and was approved by the ethics committees of the participating institutions before its initiation" [Haferlach et al., 2010, p. 2530]. The leukemia data set was preprocessed, resulting in a high-dimensional data set with 7.747 variables and 554 data points separated into natural clusters, as determined by the illness status and defined by discontinuities (see chapter 2). Additionally, patient consent was obtained for the data set, in accordance with the Declaration of Helsinki, and the Marburg local ethics board approved the study (No. 138/16) [Brendel, 2016].

# *9.1.8 Lsun3D*

The Lsun3D data set consists of three well-separated clusters and four outliers in Թଷ; it is based on the two-dimensional Lsun data set of Moutarde and Ultsch [Moutarde/Ultsch, 2005]. Two of the clusters contain 100 points each, and the third contains 200 points. "The inter-cluster minimum distances, however, are in the same range as or even smaller than the inner-cluster mean distances" [Moutarde/Ultsch, 2005, p. 28]. The data set consists of 404 data points and was not preprocessed.

### *9.1.9 S-shape*

"The plain s-curve data set is an artificial set sampled from an S-shaped two-dimensional surface embedded in three-dimensional space" [Venna et al., 2010, p. 462]. The authors claim that "an almost perfect two-dimensional representation should be possible for a non-linear dimensionality reduction method, so this data set works as a sanity check" [Venna et al., 2010, p. 462]. Here, it is more important that the data set does not possess any natural clusters. The data set consist of 2000 data points in Թଷ and was not preprocessed.

### *9.1.10 Swiss Banknotes*

*"The idea is to produce bills at a cost substantially lower than the imprinted number. This calls for a compromise and forgeries are not perfect" [G. Ritter, 2014, pp. 223-224]. "If a bank note is suspect but refined, then it is sent to a money-printing company, where it is carefully examined with regard to printing process, type of paper, water mark, colors, composition of inks, and more. Flury and Riedwyl [Flury/Riedwyl, 1988] had the idea to replace the features obtained from the sophisticated equipment needed for the analysis with simple linear dimensions" [G. Ritter, 2014, p. 224].* 

The Swiss Banknotes data set consists of six variables measured on 100 genuine and 100 counterfeit old Swiss 1000-franc bank notes. The variables are the length of the bank note, the height of the bank note (measured on the left side), the height of the bank note (measured on the right side), the distance from the inner frame to the lower border, the distance from the inner frame to the upper border and the length on the diagonal. The robust normalization of Milligan and

<sup>66</sup> Process with which information from a gene is used in the synthesis of functional RNA or protein.

Cooper [Milligan/Cooper, 1988] is applied to prevent a few features from dominating the obtained distances [Herrmann, 2011, pp. 99-100].

# *9.1.11 Target*

The Target data set [Ultsch, 2005c] consists of two main clusters and four groups of four outliers each. The first main cluster is a sphere of 363 points, and the second cluster is a ring around the sphere and consists of 395 points. The data set as a whole consists of 770 points in Թଶ. The main challenge of this data set is the four groups of outliers in the four corners. The data set was not preprocessed.

# *9.1.12 Tetra*

The Tetra data set, which is part of the FCPS, consists of 400 data points in four clusters in Թଷ that have large intracluster distances [Ultsch, 2005c]. The clusters are nearly touching each other, resulting in low intercluster distances.

# *9.1.13 Tetragonula*

The Tetragonula data set was published in [Franck et al., 2004] and is available to the public in the R package prabclus:

*"It contains the genetic data of 236 Tetragonula (Apidae) bees from Australia and Southeast Asia. The data give pairs of alleles (codominant markers) for 13 microsatellite loci. The 13 string variables consist of six digits each" [Hennig, 2014]. The format is derived from the data format used by the GENEPOP 4.0 software implemented by Rousset in 2010. "Alleles have a three digit code, so a value of "258260" on variable V10 means that on locus 10, the two alleles have codes 258 and 260. "000" refers to missing values" [Hennig, 2014].* 

# *9.1.14 Cuboid*

The uniform Cuboid data set "was constructed by filling a cuboid with uniformly distributed random numbers in the x, y and z directions" [Ultsch/Lötsch, 2016, p. 5]. It was introduced in this publication. "A group structure [is] clearly absent by construction" [Ultsch/Lötsch, 2016, p. 5]; thus, the data set does not possess any natural clusters. The data set consists of 1000 data points in Թଷ and was not preprocessed. Additionally, another data set was generated by filling the same cuboid with Gaussian-distributed random numbers in the x, y and z directions.

# *9.1.15 Two Diamonds*

"The data consists of two clusters of two-dimensional points. Inside each "diamond" the values for each data point were drawn independently from uniform distributions" [Ultsch, 2003c, p. 8]. The clusters contain 300 points each. "[In] [e]ach cluster[, the] points are uniformly distributed within a square, and at one point the two squares almost touch. This data set is critical for clustering algorithms using only distances" [Moutarde/Ultsch, 2005, p. 28]. The data set was not preprocessed.

# *9.1.16 Wine*

The Wine data set [Aeberhard et al., 1992] is a 13-dimensional, real-valued data set. It consists of chemical measurements of wines grown in the same region in Italy but derived from three different cultivars. The robust normalization of Milligan and Cooper [Milligan/Cooper, 1988] is applied to prevent a few features from dominating the obtained distances [Herrmann, 2011, pp. 99-100].

# *9.1.17 Wing Nut*

*"The Wing Nut dataset […] consists [of] two symmetric data subsets of 500 points each. Each of these subsets is an overlay of equal[ly] spaced points with a lattice distance of 0.2 and random points with a growing density in one corner. The data sets are mirrored and shifted such that the gap between the subsets is larger than 0.3. Although there is a bigger distance in between the subsets than within the data of a subset, clustering algorithms like Kmeans parameterized with the right number of clusters (k=2) produce classification errors" [Moutarde/Ultsch, 2005, pp. 27-28].* 

The data set was not preprocessed.

### *9.1.18 World Gross Domestic Product (World GDP)*

The World GDP data set of [Leister, 2016] was constructed by selecting the purchasing power parity (PPP)-converted gross domestic product (GDP) per capita for the years from 1970 to 2010 from the data published in [Heston et al., 2012] of 190 countries. The data were logarithmized, and countries with missing values were not considered. In the resulting data set, 160 countries remain.



### **9.2 Parameter Settings**

The parameter settings for the clustering algorithms, the projection methods and the QMs used in this thesis are as follows.

### *9.2.1 Quality Measures (QMs)*

Freely available implementations of the trustworthiness and discontinuity (T&D) measures and the precision and recall (P&R) measures (see chapter 6.1) in C++ code were used [Nybo/ Venna, 2015]. For all other measures, self-developed implementations were used. Every QM is available in our R package, projections, which also includes R wrappers for the C++ code for the T&D and P&R measures. Our density-based version of the Shepard diagram is also available in the R package projections. This package can be downloaded from CRAN.

# *9.2.2 Projection Methods*

For the projection methods considered here (see chapter 4), we used freely available code which is summarized in the ProjectionBasedClustering CRAN package [Thrun et al., 2017]: for principal component analysis (PCA) [Pearson, 1901], we used the PCA software available in the R package stats [R Development Core Team, 2008]; due to technical limitations ICA was omitted in the analysis; for curvilinear component analysis (CCA) [Demartines/Hérault, 1995], the CCA source code [Alhoniemi, et al., 2005] was ported from MATLAB to R and for t-distributed stochastic neighbor embedding (t-SNE) [Van der Maaten/Hinton, 2008], we used Donaldson's t-SNE implementation. Also included in the evaluation of various projection methods were the Neighbor Retrieval Visualizer (NeRV) algorithm ([Venna et al., 2010]) as implemented in the freely available C++ code [Nybo/ Venna, 2015] called in R (Thrun et al., 2017b]), the Sammon mapping technique for multidimensional scaling (MDS) [Sammon, 1969] available from [R Development Core Team, 2008], and the emergent self-organizing map (ESOM) algorithm as implemented in the R package Umatrix [Thrun et al., 2016a] which reproduced the results of [Ultsch/Mörchen, 2005].

For every projection method, only the default parameters were used, as given here (see also [Thrun et al., 2017]): The ESOM algorithm was set with 20 epochs; a planar lattice; 50 lines; 80 columns; a Euclidean neighborhood function; and a linear annealing scheme with a starting radius of 25, an end radius of 1, a starting learning rate of 0.5 and an end learning rate of 0.1.

For the NeRV method, lambda was set to 0.5 (for DCE baseline with PCA initialization) and 0.1 (default); the optimization scheme was set with 20 neighbors, 10 iterations, 2 conjugate gradient steps per iteration, and 20 conjugate gradient steps in the final iteration; and the points were randomly initialized (default). PCA and Sammon mapping did not require any input parameters. For CCA, 20 epochs, an initial step size of 0.5, and a radius of influence of 3\*max ሺݐݏ݀ሺ݀ܽݐܽሻሻ were specified. The t-SNE method was set with a perplexity of 30,100 epochs and a maximum number of iterations of 1.000. Aside from ESOM, every projection method is available through standardized wrappers in our R package projections on CRAN. The NeRV source code was modified only as required for compatibility with the CRAN package Rcpp. The Delaunay classification error (DCE) measure is also available in our R package projections on CRAN.

### *9.2.2.1 Swarm-Organized Projection (SOP)*

The SOP parameterization was chosen following Herrmann [Herrmann, 2011, p. 98], using a 64 x 64 toroidal lattice with Gaussian neighborhoods, as described above. Further parameter specifications included a maximum of 500 iterations per epoch (for a single radius) and a jumping DataBot threshold of 5%. In a given iteration, the DataBots were allowed to jump only if the number of DataBots that wished to jump was above this threshold. If only 5% or fewer of the DataBots could find a better position or if the maximum number of iterations was exceeded, the radius was reduced. The starting radius was set to the maximum possible distance in the output space as defined by [Herrmann, 2011, p. 65]. The source code was implemented in R by Kohlhof [Kohlhof, 2010] under the supervision of Lutz Hermann and the SOP algorithm was executed using version 3.2.3 of R on a 64-bit Windows 7 system. Only Euclidean distances were used for SOP, consistent with the settings defined by [Herrmann, 2011, p. 98] and the restrictions of the source code. For this reason, the GDP194 data set was excluded because this

data set requires the use of special dissimilarities [Herrmann, 2011, p. 100]. Moreover, it should be mentioned that Rmin was set to a value much larger than 1 for this data set, although the precise number was not recorded [Herrmann, 2011, p. 167].

Other functional code for SOP or its extension for very large data sets, swarm-organized quantization, was not available to the author67. A self-developed implementation based on the algorithm exactly as described in chapter 7 yielded worse results on the data sets compared with that of Kohlhof [Kohlhof, 2010] because of the problems discussed in chapter 8.

#### *9.2.2.2 Pswarm*

For Pswarm, there are no parameters to set. In the case of the Wine data set, the distances were changed to squared Euclidean distances because the resulting distance distribution yielded a better distinction between the intra- and intercluster distances (see supplement B). The data sets were compared using the generalized U-matrix technique for three-dimensional visualization, as described in chapter 5. The CRAN R package Databionic swarm was used [Thrun, 2017]. Notably, the three-dimensional topographic map with hypsometric tints that is referred to as the generalized U-matrix in this thesis is completely different from the gray-scale two-dimensional visualization of Hermann [Herrmann, 2011, p. 72], which was also called the generalized Umatrix.All source code was executed in R 3.3.1 [R project, , 2008] on a 64-bit Windows 7 system.

#### *9.2.3 Common clustering algorithms*

For the k-means algorithm, the CRAN R package cclust was used [Dimitriadou/Hornik 2017]. For the single linkage (SL) and Ward algorithms, the CRAN R package stats was used [R Development Core Team, 2008]. For the Ward algorithm, the option "ward.D2" was used, which is an implementation of the algorithm as described in [Ward Jr, 1963]. For the spectral clustering algorithm, the CRAN R package kernlab was used [Karatzoglou et al., 2016] with the default parameter settings: "The default character string "automatic" uses a heuristic to determine a suitable value for the width parameter of the RBF kernel", which is a "radial basis kernel function of the "Gaussian" type". The "Nyström method of calculating eigenvectors" was not used (=FALSE). The "proportion of data to use when estimating sigma" was set to the default value of 0.75, and the maximum number of iterations was restricted to 200 because of the algorithm's long computation time (on the order of days) for 100 trials using the FCPS data sets. For the mixture of Gaussians (MoG) algorithm, the CRAN R package mclust was used [Fraley et al., 2017]. In this instance, the default settings for the function "Mclust()" were used, which are not specified in the documentation. For the partitioning around medoids (PAM) algorithm, the CRAN R package cluster was used [Maechler et al., 2017].

#### **9.3 Gene Ontology (GO)**

An ontology is a representation of knowledge in which the relationships *part of* and *is a* are visualized in a directed acyclic graph (DAG). For the analysis of pain genes, the GO database was accessed via R 3.3.1 [R Development Core Team, 2008]. In the GO database, knowledge

<sup>67</sup> Lutz Herrmann's 2011 Java implementation is largely identical to that of [Kohlhof, 2010], but the source code could not be compiled.

about molecular functions, biological processes and the cellular components of genes is defined using a controlled vocabulary consisting of labels called GO terms, which are used to represent biological concepts [Ashburner et al., 2000]. These terms describe and unify the attributes of genes and gene products68 in a species-independent manner. "The GO terms are ordered in a directed acyclic graph (DAG), in which the set of genes annotated69 to a certain term (node) is a subset of those annotated to its parent nodes" [Goeman/Mansmann, 2008]. Here, the important relationships between the nodes are of the "part of" type, resulting in a "top-down poly-hierarchy of GO terms" starting "at the root with terms with the broadest definition" and specializing "toward the leaves representing GO terms of the narrowest definition (details)" [Ultsch et al., 2016b]. Given a set of genes, ORA reveals the significance of a GO term that represents these genes or a subset of these genes [Backes et al., 2007].

#### *9.3.1 Overrepresentation Analysis (ORA)*

*"In ORA, the most commonly used statistical test is based on the hypergeometric distribution or its binomial approximation ([…] among others). Let A denote a GO term or the set of genes annotated to A (with cardinality* ܫ*(, and let S denote the set of genes (with cardinality* ܫௌ*) based on a certain criterion (i.e. differential expression) from a full gene list G (with cardinality I) in an experiment. The number of genes belonging to both S and A (S∩A), denoted by* ݊*, indicates the representation of A in S. Under the null hypothesis that S and A are independent (i.e. the GO term is irrelevant to the gene cluster),* ݊ *follows a hypergeometric distribution. The [p-value [measuring the significance of association is the tail probability of observing* ݊*, or more genes annotated by A in S,* 

$$p = \sum\_{k=n\_{\mathcal{A}}}^{\min(I\_{\mathcal{A}}, I\_{\mathcal{S}})} \frac{\binom{I\_{\mathcal{A}}}{k} \binom{I - I\_{\mathcal{A}}}{I\_{\mathcal{S}} - k}}{\binom{I}{I\_{\mathcal{S}}}} \tag{9.1}$$

*where* ቀ ݉ ݊ ቁ ൌ ! !ሺିሻ!  *is the binomial coefficient. Many software packages and webtools (Onto-Express, CLAS-SIFI, GoMiner, EASEonline, GeneMerge, FuncAssociate, GOTree Machine, etc.) have been developed based on the hypergeometric [p-value]. A detailed review can be found in Khatri and Drăghici [Khatri/Drăghici, 2005].* 

*The hypergeometric [p-value] provides a straightforward measure of overrepresentation for each individual GO term. However, the major drawback of this approach is that it ignores the hierarchical structure in the GO DAG, which contains a substantial amount of information regarding the interactions among the GO terms" [Zhang et al., 2010, pp. 905-906].* 

For the ORA algorithm, the R package ORA was used [Lippmann et al., 2016].

#### *9.3.2 Filtering via ABC Analysis*

The resulting p-values were filtered via ABC analysis (see chapter 5.3.2 on p. 49 for further explanation) [Ultsch/Lötsch, 2015]; thereafter, only the most important group A was considered for interpretation. For the ABC analysis algorithm, the CRAN R package ABC analysis was used [Thrun et al., 2015].

Here, it is argued that changing the threshold with respect to the significance of the p-value does not lead to better results. Aside from the problems discussed by Button and Nuzzo [Button et al., 2013; Nuzzo, 2014], the paramount goal of a gene analysis is to find GO terms with a

<sup>68</sup> Usually either Ribonucleic acid (RNA) or a protein

<sup>69</sup> For further details, see [Camon et al., 2003] and [Camon et al., 2004].

high effect strength. For this purpose, it is sufficient for the effect to be significant with regard to a commonly used (arbitrary) p-value threshold.

Let *E* be the strength of an effect as defined with respect to its p-value significance *p* (expressed as a percent), as follows:

$$E = -10\log(p)\qquad\text{(9.2)}$$

At first glance, the definition given in Eq. 9.2 is contradictory to the equation above (1).

On the one hand, the calculation of p-values based on the Fisher test with ሺܫ,ܫௌ, ݇, ܫሻ requires four parameters; on the other hand, one would calculate the strength of an effect based on the relative difference between the expected value *e* and the observed value *o*, known as the fold :ܥܨ change

$$F\mathcal{C}(k,e) = 2\frac{o-e}{o+e} \qquad \text{(9.3)}$$

Here, the p-values are calculated analogously to Backes [Backes et al., 2007], where the formula is called the hypergeometric test. However, the hypergeometric test is simply the Fisher test based on the hypergeometric distribution [Ultsch, 2014a]. The hypergeometric distribution is defined as

$$f(I\_{\mathcal{A}}, I\_{\mathcal{S}}, k, I) = \frac{\binom{I\_{\mathcal{A}}}{k} \binom{I - I\_{\mathcal{A}}}{I\_{\mathcal{S}} - k}}{\binom{I}{I\_{\mathcal{S}}}} \tag{9.4}$$

Given this distribution, the expected value ݁ሺ݂ሻ is defined as

$$e(f) = \sum\_{k=0}^{I\_S} k \frac{\binom{I\_A}{k} \binom{I - I\_A}{I\_S - k}}{\binom{I}{I\_S}} = I\_S \frac{I\_A}{I} \tag{9.5}$$

It can be shown that Eq. 9.2 is directly proportional to the definition of the expected number of genes in Eq. 9.5 [Ultsch, 2014a]. Therefore, the observed number of genes *o* are compared against a hypergeometric distribution (Eq. 9.4) around the value for the expected genes number of *e* in Eq. 9.5, and in the special case of ORA, the p-values imply more than merely significance.

One may ask why the calculation must be complicated if the fold change, as defined in Eq. 9.3, could be used. The disadvantage of the fold change is illustrated in the following equation:

$$FC(o,e) = 2\frac{o-e}{o+e} = 2\frac{c\*o-c\*e}{c\*o+c\*e} \tag{9.6}$$

According to this equation, one expected gene compared with four observed genes yields the same value as 100 expected genes compared with 400 observed genes. Clearly, the effect strength here is not the same.

It could be argued that this problem could be solved by reducing the p-value threshold to a low level, such that only a few GO terms are represented in the DAG. However, one would be obliged to do this manually for every ORA calculation. Moreover, to the author's knowledge, every tool or package that uses GO terms or performs ORA calculations has a different version of the GO database. Hence, the p-value calculation has a measurement error that is difficult to specify. Furthermore, even if a tool used the database obtained directly from the Gene Ontology Consortium, there is an even stronger source of measurement error: every list of genes ܫௌ to be

analyzed was obtained based on microarray experiments with arbitrary thresholds or probe intensities (for a detailed discussion, see [Khatri et al., 2012, p. 3]).

Here, with regard to the definition of the effect strength given in (Eq. 9.2), it is assumed that the magnitudes of the p-values do not change regardless of measurement errors. This is the reason for taking the logarithm of the p-value in (Eq. 9.2). Moreover, Figure 9.1 shows the correlation between the fold change FC (Eq. 9.3) and the effect strength E (Eq. 9.2) for a given interval of the number of annotated genes per GO term. Consistent with Ultsch [Ultsch, 2014a], it is argued here that in ORA, the p-values are directly proportional to the effect sizes.

After setting the p-value threshold to *0.05*, which is a generally accepted level of significance, and calculating the corresponding GO terms, the results of an ABC analysis of the effect strengths as given by (2) can be obtained. The relevant GO terms are defined as those assigned to group A in the ABC analysis.

Figure 9.1: Scatter plot of the fold changes *FC* of Eq. 9.6 and the corresponding *E* value of Eq. 9.3 for numbers of annotated genes per GO term in the range [10,25] is proportional.

License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. **Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **10 Results on Pre-classified Data Sets**

This chapter has three sections. In the first section, the results of the Databionic swarm (DBS) clustering framework are compared with the given prior classifications for data sets from the Fundamental Clustering Problems Suite (FCPS) [Ultsch, 2005a]. The results for nine data sets analyzed using common clustering algorithms are compared in the first subsection. In the second subsection, the results for data sets with no natural clusters are compared (e.g., Golf Ball). Neighbor Retrieval Visualizer (NeRV) projection and Ward clustering indicate the presence of clusters, whereas DBS does not.

The second section compares Pswarm with other common projection methods using the Delaunay clustering error (DCE). The third section compares emergent self-organizing map (ESOM), swarm-organized projection (SOP) and Pswarm using topographic map visualizations based on the generalized U-matrix for the Wine, Iris, and Swiss Banknotes data sets as well as several FCPS data sets.

# **10.1 Comparison with Given Classifications**

The FCPS [Ultsch, 2005a] is a repository consisting of ten data sets with known classifications. These data sets are intentionally simple enough to be visualized (in 2D or 3D) but nevertheless present a variety of problems that offer good tests of the performance of clustering algorithms [Ultsch/Lötsch, 2016]. The first Figure (10.1) shows the performance of several common clustering algorithms compared with DBS based on 100 trials. The performance is depicted using boxplots of the error rate, which is defined as one minus the accuracy and for which 50% is the level attributable to chance (see chapter 3, Eq. 3.1). Here, the common clustering algorithms considered are single linkage (SL) [Florek et al., 1951], spectral clustering [Ng et al., 2002], the Ward algorithm [Ward Jr, 1963], the Linde-Buzo-Gray algorithm (LBG-k-means) [Linde et al., 1980], partitioning around medoids (PAM) [L. Kaufman/Rousseeuw, 1990] and the mixture of Gaussians (MoG) method with expectation maximization (EM) [Fraley/Raftery, 2002] (also known as model-based clustering).

Aside from the number of clusters, which is given for each of the artificial FCPS data sets, only the default parameter settings of the clustering algorithms were used. ESOM/U-matrix clustering [Ultsch et al., 2016a] and DBscan [Ester et al., 1996] were omitted because no default clustering settings exist for these methods. k-means has the highest overall error rate, and spectral clustering shows the highest variance. The results for the other clustering algorithms vary depending on the data set. DBS has the lowest overall error rate. However, on the Tetra data set, it is outperformed by PAM and MoG; on the EngyTime data set, it is outperformed by MoG; and in the case of the Wing Nut data set, it is outperformed by spectral clustering. Additional statistical tests to Fig 10.1 can be found in supplement I. With the help of insights from chapter 3, Tab. 3101 lists the FCPS cluster structures alongside the algorithms with the best results in terms of the lowest error rate and variance for each data set.

Figure 10.1: Error rate (see p. 107) of 100 trials of common clustering algorithms on nine FCPS data sets, shown as boxplots with the notch as median; chance level at 50%. The interactive clustering approach of DBS was not used here. Abbreviations: single linkage (SL), Linde-Buzo-Gray algorithm (LBG-kmeans), partitioning around medoids (PAM), mixture-of-Gaussians clustering (MoG), Databionic swarm (DBS). Additional statistical tests can be found in supplement I.

#### *10.1.1 Recognition of the Absence of Clusters*

The Golf Ball data set (see chapter 9) does not exhibit natural clusters. Therefore, it is analyzed separately because, with the exception of SL and the Ward algorithm, the common clustering algorithms give no indication regarding the existence of clusters. This "cluster tendency problem has not received a great deal of attention but is certainly an important problem" [Jain/Dubes, 1988, p. 222]. Reproducing the results of [Ultsch/Lötsch, 2016], the Ward algorithm indicates six clusters, whereas SL indicates two clusters (Figure 10.2). As seen from the two dendrograms generated using DBS, the connected approach does not indicate any clusters, whereas the compact approach indicates four clusters (Figure 10.3). However, the presence of four clusters is not confirmed by the topographic map of DBS.

In Figure 10.4, the topographic maps of DBS with the NeRV are compared. The NeRV projection of the Golf Ball data set with ߣ ൌ 0.5 (for the other parameters, see the R package projections), i.e., with precision and recall weighted equally, is shown in Figure 10.4 (top). The visualization of the NeRV projection strongly indicates a two-cluster structure, whereas the DBS projection does not (Figure 10.4, bottom). The compact DBS clustering divides the data points lying in valleys into different clusters and merges the data points into clusters through hills, resulting in cluster borders that are not defined by mountains.

The topographic map of DBS of the S-shape data set and the uniform and Gaussian Cuboid data sets (see chapter 9) are also shown in supplement D, Figure D.19. Neither data set contains any natural clusters; this is correctly visualized using the DBS approach.

Figure 10.2: The dendrogram generated using the Ward algorithm indicates at least two clusters with a high intercluster distance. The SL dendrogram could indicate two clusters with a very low intercluster distance.

Figure 10.3: The two dendrograms generated using DBS. The connected DBS clustering does not indicate any structure whereas the compact DBS clustering indicates two or four clusters. The connected approach does not indicate any clusters, whereas the compact approach does indicate four clusters. However, Figure 10.4 shows that these clusters are inconsistent with the visualization.

#### **10.2 Evaluation of Projections Using the Delaunay Classification Error (DCE)**

Figure 10.5 shows the results for the DCE measure, relative to the baseline, for 100 trials of the common projection methods ESOM, NeRV, Sammon mapping (a multidimensional scaling (MDS) technique), curvilinear component analysis (CCA), principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE). Positive values indicate higher errors compared with the baseline, whereas negative values indicate lower errors. The baseline is the NeRV projection with ߣ ൌ 0.5 and PCA initialization; this baseline was chosen because the outcome of this initialization is deterministic (for the other parameters, see the R package projections). The parameter setting ߣ ൌ 0.5 indicates that precision and recall are weighted equally. Every subfigure shows a robust mean estimate *M* and a robust standard deviation estimate *S* for the 100 relative DCEs. Notably, it is claimed that t-SNE projections are similar to NeRV projections with ߣൌ1 [Venna et al., 2010].

The linear method PCA and the MDS technique of Sammon mapping are unable to separate the connected structures of the Chainlink and Atom data sets based on their assumed neighborhood relations. This result confirms the assumptions made in chapter 4. By contrast, the CCA projections have difficulty separating compact structures based on intra- versus intercluster distances. However, not all focusing projection methods are able to separate connected structures, e.g., the t-SNE projections of Chainlink.

Without the U-matrix, the ESOM projection method distributes the points uniformly, which results in a higher DCE. The projections generated by t-SNE, Pswarm and NeRV with their default settings show high variances, although the variance in the accuracy of the DBS clustering results for these data sets is low (Figure 10.1).

Figure 10.4: **Top**: Topographic map of the NeRV projection (ߣ ൌ 0.5) of the Golf Ball data set indicates two well-separated clusters. **Bottom**: The topographic map of the DBS projection and (compact) clustering of the Golf Ball data

set. The projection does not indicate a cluster structure. The DBS clustering generates clusters that are not separated by mountains. No island can be extracted from the toroidal visualization.

Statistical testing was performed using the two-sample, one-sided Wilcoxon rank sum test with continuity correction [Hollander/Wolfe, 1973, pp. 68–75]. The DCE values for the Pswarm projections were compared with the projections obtained using the other methods with the "nearest"70 ranges of DCE values "above" and "below" those of Pswarm (visually in the 90° rotated figures). In the former case, means that the DCE values of Pswarm are more negative (shifted to the left) compared with the DCE values of the projection method with the nearest range of values. Consequently, a significant result means that Pswarm's performance is considerably better. In the latter case, the DCE values of Pswarm are more positive (shifted to the right), and a significant result means that Pswarm's performance is worse than that of the projection method with the nearest range of DCE values "below" those of Pswarm. Statistical results regarding the performance of Pswarm in Figure 10.5 are as follows.


#### **10.3 Topographic Maps with Hypsometric Colors**

To compare Pswarm as a projection method with SOP and ESOM, the data sets of [Herrmann, 2011, pp. 99-100] were used. After the computation of several trials based only on the visually best71 scatter plot, topographic maps with hypsometric colors (hypsometric tints) were generated. The Atom, Chainlink, EngyTime, Iris, Swiss Banknotes, and Wine data sets were projected using SOP, ESOM and Pswarm and visualized using the U-matrix or generalized Umatrix approach.

Figure 10.6 shows that only the colored labels corresponding to the prior classification separate the two clusters of EngyTime. The topographic map is inconsistent with the projected points in terms of lattice locations. Moreover, the separation is blurry, and several points are misplaced. Notably, the cardinality of the data set is 4096, and there are only 4096 positions on a 64x64 lattice. However, the visualization presented in Figure 10.6 shows many empty positions. Consequently, there are many positions at which more than one DataBot is located; therefore, the colored labels could be misleading, and the quality measures of [Herrmann, 2011] could be incorrect.

<sup>70</sup> With the highest overlap in ܯേܵ. It is assumed that non-overlapping ranges of DCE values are always statistically significant. 71 In the sense that the structures defined by the prior classification were preserved.

Figure 10.5: Relative DCE values for projections of the Atom, Hepta, Lsun3D, Chainlink and Tetra data sets. The following seven methods are compared: Pswarm ESOM, CCA, PCA, Sammons mapping, NeRV and t-SNE. The most structure-preserving projections have the lowest negative values. No projection method is able to outperform any other projection method on five all data sets.

Table 10.1: Cluster structures in the artificial benchmark sets of the FCPS [Ultsch, 2005a], as defined in chapter 2. The clustering algorithms with the lowest error rate and variance in Figure 10.1 are listed for each data set. These results confirm the assumptions discussed in chapter 3 regarding the cluster structures sought by common clustering algorithms. On the right the projection methods who were unable to find the structure are listed for the three-dimensional data sets. ESOM method is omitted, because it distributes the projected points uniformly. Additional statistical tests can be found in supplement I.


By contrast, in the topographic map of the Pswarm projection shown in Figure 10.7, the clusters are clearly separated by both the positions of the projected points and the high-dimensional distances and densities of the generalized U\*-matrix. Here, only one DataBot is allowed per grid position. In comparison to Figure 10.7, the planar ESOM/U\*-matrix projection presented in Figure 10.8 does not clearly show the border between the two clusters. As shown in Figure 10.9, when the default settings (toroidal) are used, it is difficult to distinguish between the two clusters. Because the extraction of an island was not possible, a tiled display is shown in Figure 10.9. Likewise, for the Wing Nut data set, the topograpic map of the Pswarm projection shows a clear cluster structure, whereas the toroidal ESOM/U-matrix projection does not (Figure 10.10 and supplement E, Figure E.23) when the P-matrix and U\*-matrix visualization is not used.

On the Iris data set, the topographic map of the generalized U\*-matrix of the SOP result shows three clusters that are clearly separated by hills, but these clusters do not match the colored labels of the prior classification (supplement C, Figure C.13). By contrast, the Pswarm projection visualized using the generalized U\*-matrix approach does show these clusters, one of which is defined by its density (supplement C, Figure C.14). Five points are misplaced. The ESOM/U-matrix method is unable to separate two of the three clusters (supplement E, Figure E.22).

Figure 10.6: Topographic map of the EngyTime data set projected using SOP with the default parameters: The two clusters are mixed and difficult to separate without the colored labels corresponding to the classification. The radius of the P-matrix was automatically chosen to be 1.38. No island could be extracted.

Figure 10.7: Topographic map of the EngyTime data set projected using DBS (196x220) with an automatically chosen lattice size: There are clearly two clusters with an accuracy of the DBS clustering of 95%

Figure 10.8: U\*-matrix visualization of the toroidal ESOM projection of the EngyTime data set: The data set contains 4096 observations, and the lattice contains 4096 neurons. As shown, not every neuron is a best matching unit (BMU); therefore some BMUs include more than one observation, and the colored labels are misleading. The clusters are mixed, and no border between the green and blue BMUs can be found.

Figure 10.9: U\*-matrix visualization of the planar ESOM projection of the EngyTime data set: The data set contains 4096 observations, and the lattice contains 4096 neurons. As shown, not every neuron is a best matching unit (BMU); therefore, some BMUs include more than one observation, and the colored labels are misleading. The clusters are mixed, and a border between the green and blue BMUs is difficult to locate.

Figure 10.10: Topographic map of the DBS projection of the Wing Nut data set with Generalized Umatrix (64x68). Both clusters are clearly separated, but four points are misplaced.

The topograpic map of the Swiss Banknotes data set as projected using SOP shows three clusters based on high-dimensional distances in the generalized U-matrix, with one misplaced point (supplement C, Figure C.9). Without the topographic map, a scatter plot of the projected points would not lead the reader to the conclusion that the data set consists of separate clusters because the projected points defined by the DataBots are uniformly distributed. By comparison, Pswarm reveals two unambiguously separated clusters with two misplaced points (supplement C, Figure C.10). In the ESOM/U-matrix projection, one best matching unit is misplaced. The cluster of blue best matching unit could be interpreted as two clusters, one small and one large, based on the high hills in between (supplement E, Figure E.21).An interpretation of the uniformly distributed projected points of the Wine data set, as generated via SOP, does not allow the number of clusters to be determined (supplement C, Figure C.11). The generalized U-matrix shows no clear borders between projected points with differently colored labels. Several points are misplaced. By contrast, the topographic map of the Pswarm projection explicitly shows three clusters (supplement C, Figure C.12). — one triangular, one rectangular and one square — but six points are misplaced. In the ESOM/U-matrix projection, the clusters in the Wine data set are difficult to separate without their colored labels (supplement E, Figure E.20). Again, in the SOP result for the Atom data set, the clusters are distinguished only by the borders of the generalized U-matrix and the colored labels corresponding to the prior classification because the points are uniformly distributed (supplement C, Figure C.15). However, the visualization could also be misleading in suggesting that the data set consists of three clusters. The topographic map of the Pswarm projection explicitly shows two clusters (supplement C, Figure C.16).The projections of the Chainlink data set obtained using both SOP and Pswarm are similar (supplement C, Figure C.17) but the Pswarm visualization is smoother in terms of intracluster structure (supplement C, Figure C.18).

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. **Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International

# **11 DBS on Natural Data Sets**

Several real-world data sets are used in this chapter to show that Databionic swarm (DBS) is able to find clusters in a variety of cases. The leukemia data set is based on luminance measurements of 7747 different active or non-active genes in 554 human subjects. The World GDP data set is a multivariate time series that consists of monetary values for 190 countries from 1970 to 2010. The Tetragonula data set contains 13 string variables consisting of pairs of alleles for 13 microsatellite loci in bees. In each case, suitable preprocessing and a correctly chosen distance definition make it possible for DBS to cluster and visualize the data such that the known knowledge is reproduced.

# **11.1 Types of Leukemia**

The leukemia data set consists of 7747 variables for 554 subjects (for details, see chapter 3). Of the subjects, 109 were healthy, 15 were diagnosed with acute promyelocytic leukemia (APL), 266 had chronic lymphocytic leukemia (CLL), and 164 had acute myeloid leukemia (AML). The leukemia data set is a high-dimensional data set with natural clusters specified by the illness status and defined by discontinuities (for details, see chapters 3 and 9).

Figure 11.1 shows a visualization of the healthy patients and the patients diagnosed with these three major types of leukemia. The four groups are well separated by mountains, with the subjects represented by points of different colors. Magenta points indicate healthy subjects, whereas points of other colors indicate ill subjects. The automatic clustering of DBS is able to separate the four groups with an accuracy of 99.6%. Two outliers can be seen in Figure 11.1, marked with red arrows. These green and yellow outliers cannot be explained without deanonymization of the patients, which was not feasible for the author. They may be misclassified, but a future publication will address this diagnostic problem72.

# **11.2 World Gross Domestic Product (World GDP)**

The World GDP data set, published in [Leister, 2016], consists of data on the gross domestic product (GDP) per capita for 160 countries over the past 40 years (see chapter 9 for details). The dynamic time warping (DTW) distances were calculated using the R package dtw [Giorgino, 2009], which computes the optimal alignment between two time series [Giorgino, 2009]. The homogeneity of the cluster structures of DBS is visualized in a silhouette plot in Figure 11.4, the result of the DBS method in Figure 11.2 shows this clear cluster structure and it is confirmed by the heatmap in Figure 11.3.

As the rules deduced through Classification and Regression Tree (CART) analysis show in Figure 11.5, the clusters are defined by a tragic event that occurred in 2001, the crashing of airplanes into the World Trade Center. In its aftermath, "the world economy was experiencing its first synchronized global recession in a quarter-century" [Makinen, 2002, p. 17].

<sup>72</sup> It should be remarked that a data-driven DBS clustering does not reproduce the classification(s) of AML (like FAB subtypes) or CLL of research in this area, e.g. [Bene et al., 1995; Bennett et al., 1985; Vardiman et al., 2009; Haferlach et al., 2010], for CLL see [Rosenwald et al., 2001]. See also p. 30 fn. 19.

Figure 11.1: Topographic map with DBS clustering results for the leukemia data set, showing six clusters and an accuracy of 99.6% in comparison with the prior classification of four leukemia statuses. **Top**: healthy (magenta), AML (cyan), APL (blue), and CLL (black). Two outliers are marked with red arrows: an APL outlier (green) and a CLL outlier (yellow). **Bottom**: 3D print (see [Thrun et al., 2016a]), colors are not available yet due to technical limitations. Therefore, the first cluster consists mostly of African and Asian countries, which were generally unaffected by this event, and the second cluster consists of American and European countries, which were affected. The outlier is Equatorial Guinea, where the first Parliamentary elections since 1968 were held in 1983. Equatorial Guinea shows the smallest variance in its GDP, which is mostly based on oil — this small country, with an area of 28,000 square kilometers, is one of sub-Saharan Africa's largest oil producers.

Figure 11.2: Topographic map of the DBS clustering of the World GDP data set shows two distinctive clusters. There is one outlier, colored in magenta and marked with a red arrow.

Figure 11.3: Heatmap of the dynamic time warping (DTW) distances for the World GDP data set shows a small variance of intracluster distance.

Figure 11.4: Silhouette plot of the DBS clustering results for the World GDP data set indicates that data points (y-axis) above a value of 0.5 (x-axis) have been assigned to an appropriate cluster.

Figure 11.5: Classification and Regression Tree (CART) analysis rules for the clusters. The two main clusters are defined only by an event in 2001.

#### **11.3 Tetragonula Bees**

The Tetragonula data set was published in [Franck et al., 2004] and contains the genetic data of 236 Tetragonula bees from Australia and Southeast Asia, expressed using 13 variables (for details, see chapter 9), with a specific distance definition.

The shared allele distance is described in [Hausdorf/Hennig, 2010, p. 493] as follows:

*"[The distance is] defined as one minus the proportion of alleles shared by 2 individuals averaged over loci. Loci with missing values are not considered in the pairwise distance calculation. In the presence of missing values, this distance measure is not necessarily a metric."* 

For the distance calculation, the R package fpc of [Hausdorf/Hennig, 2010] was used with the distance introduced by [Bowcock et al., 1994].

The first DBS visualization implied the existence of 8 clusters and two pairs of outliers. Hence, 100 trials of Pswarm projection and DBS clustering with k=10 clusters were generated, and the best one (i.e., the one with the smallest Delaunay clustering error (DCE)) was chosen (Figure 11.7). The silhouette plot indicates a hyperspherical cluster structure (Figure 11.6) and the heatmap of the distances in Figure 11.9 confirmed the DBS clustering. This application of DBS illustrated the possibility of using multiple swarms by means of parallel computing, for which the term *deep swarming* (see [Ultsch, 2016b]) is introduced in this work in analogy to deep learning [Goodfellow et al., 2016]. Additionally, using the prabclus package, the largest withincluster gap, the cluster separation, and the average within-cluster dissimilarity of [Hennig, 2014] were calculated to be 0.5, 0.33 and 0.29, respectively. These values are the minima reported in [Hennig, 2014], presented there in Fig. 4. Seven clusters of the average linkage hierarchical clustering with ten clusters ([Hennig, 2014, p. 5]) could be reproduced (see supplement H) with a total accuracy of 93%. Finally, as Figure 11.8 shows, the clusters strongly depend on the geographic origins of the bees:

*"Longitude (x-axis) and latitude (y-axis) of locations of individuals in decimal format, i.e. one number is latitude (negative values are South), with minutes and seconds converted to fractions. The other number is longitude (negative values are West)" (see [Hennig, 2014] and the prabclus package).* 

 After the transformation into a two-dimensional plane Figure 11.8 shows that the first eight clusters (96% of data) are consistent with the geography (top) except for the Outliers in Queensland (bottom). The dependency on geography was also illustrated in [Franck et al., 2004, p. 2319].

Figure 11.6: Silhouette plot of the Tetragonula data set, showing very homogeneous cluster structures because most of the data points (y-axis) are above a value of 0.5 (x-axis).

Figure 11.7: Topographic map of the DBS clustering of the Tetragonula data set with the best DCE shows eight clusters and three groups of outliers. The cluster labels are colored as shown on the right, and a similar color code is used in Figure 11.8 below. Clusters are ordered sequentially by the number of samples such that in cluster 1 lies the bee species with the highest occurrence.

Figure 11.8: Clustering is consistent with the geographic origins: The first eight clusters (96% of data) are consistent with the geography (top) except for the Outliers in Queensland (bottom). Pictures were generated using the ggmap CRAN package.

Figure 11.9: Heatmap of the distances for the Tetragonula data set shows large intercluster distances.

License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. **Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **12 Knowledge Discovery with DBS**

In contrast to chapter 11, in which Databionic swarm (DBS) clustering was applied to recognize more or less obvious knowledge, this chapter shows that DBS is also able to discover new knowledge. A hydrological data set of multivariate time series [Aubert et al., 2016] and a data set consisting of pain genes [Ultsch et al., 2016b] are used for this purpose. In [Aubert et al., 2016], a high-frequency time series analysis was performed, but no prediction could be made. Here, the focus is placed on daily frequency.

The analysis of [Ultsch et al., 2016b] concentrated on chronic pain, and for that reason, it required searching for candidate genes that modulate pain chronification. This chapter, however, focuses on defining the distances between genes and grouping genes by semantic similarity, which can be explained based on overrepresentation analysis (ORA) [Backes et al., 2007].

# **12.1 Hydrology**

*"Human activities modify the global nitrogen cycle, particularly through farming. These practices have unintended consequences; for example, nitrate lost from terrestrial runoff to streams and estuaries can impact aquatic life" [Aubert et al., 2016].* 

A greater understanding of water quality variations can improve the evaluation of the state of water bodies and lead to better recommendations for appropriate and efficient management practices [Cirmo/McDonnell, 1997]**.** Accordingly, the objective here is to predict water quality in the Schwingbach catchment73 using the currently available variables related to chemical water quality: nitrate and (electrical) conductivity (*N&C*) which is a part of the science of hydrology. Electrical conductivity is a measure that reflects the water quality as a whole; this is because it indicates the variations in the presence of ions other than nitrate in the water body [Aubert, 2015]. Nitrate in water bodies is partially responsible for the phenomenon of eutrophication [Diaz, 2001]. Eutrophication occurs when an excess of nutrients (i.e., nitrate) leads to uncontrollable growth of aquatic plant life, followed by a depletion of the dissolved oxygen [Diaz, 2001; Howarth et al., 1996]. For this reason, the nitrate concentration is one of the parameters used to evaluate water quality.

*"The available dataset contained in total 32,196 data points for each of the 14 variables (in total, 4% missing data). For technical reasons, no nitrate data were available during winter, so the actual time span of nitrate monitoring was 05 March 2013 12:45 to 24 September 2013 12:30 and 27 April 2014 00:00 to 23 October 13:15. Data were analyzed as a whole, without differentiating between the hydrological years"* [Aubert et al., 2016].

Conductivity, in particular, will be explained using another set of variables, which are indicators of hydrological and biological conditions. In contrast to the temporal high-frequency analysis (with 15-minute intervals) of [Aubert et al., 2016], here, the daily courses for each variable were calculated as the sums of all daily measurements, resulting in a low-frequency analysis. The missing values were imputed using the seven-nearest-neighbors approach. All variables were linearly decorrelated, and the logarithms of the variables q13 and q18 were calculated. Subsequently, all variables, with the exception of rain, were normalized to values between zero

M. C. Thrun, *Projection-Based Clustering through Self-Organization and Swarm Intelligence*, https://doi.org/10.1007/978-3-658-20540-9\_12

 73 A catchment is a dynamic system, and current observations depend on previous hydrological states [Aubert et al., 2016].

and one through robust normalization. The outliers in the rain variable were detected via ABC analysis [Ultsch/Lötsch, 2015]: in the ABC analysis, rain was normalized with respect to the minimum value in group A and then all points in group A were set to a value of 1.1 for rain, and. After feature selection the data set had in 12 variables over 343 days.

The preprocessed daily courses are shown in Figure 12.1. The preprocessing resulted in Euclidean distances with a multimodal distribution (Figure 12.2). The first mode represents the intracluster distances, and the second mode represents the intercluster distances (see also chapter 3, Figure 3.1).

DBS was used for visualization and clustering. The outliers were marked interactively, resulting in five classes (Figure 12.4). The clusters have small intracluster distances and high intercluster distances, as visualized using DBS (Figure 12.4) and confirmed by the heatmap (Figure 12.4). The silhouette plot shows that all clusters can be well modeled as hyperspheres (Figure 12.3).

Figure 12.1: Variances of variables after preprocessing and feature extraction visualized using boxplots after the preprocessing of the hydrology data set.

Figure 12.2: Distribution analysis of the distances. The first mode represents the intracluster distances, and the second mode represents the intercluster distances (for further explanation see chapter 3, Figure 3.1).

Figure 12.3: Silhouette plot of the DBS clustering set indicates that data points (y-axis) above a value of 0.5 (x-axis) have been assigned to an appropriate cluster.

Figure 12.4: Five clusters are shown in the topographic map of DBS of the Hydrology data set. For 3D print see supplement G, Figure G.24.

Figure 12.5: The five clusters have clearly distinctive distances, as shown by the heatmap; there are small distances within each cluster and large distances between the clusters.

Figure 12.6: Classification and Regression Tree (CART) analysis rules for the hydrology data set with the five clusters identified by DBS. Applying the rules to the clustering combined with the data set results in three misclassified points (0.9%). Abbreviations: rainfall intensity (rain), soil temperature (St24), soil moisture (Smoist24), groundwater level at point 3 (GWl3). All values are expressed as percentages.

#### *12.1.1 Knowledge Acquisition and Prediction in the Hydrology Data Set*

Here, the rules extracted from the Classification and Regression Tree (CART) decision tree, as shown in Figure 12.6, were applied to the clustering. In comparison to the DBS clustering, the application of the CART rules to the data set results in the misclassification of three data points (0.9%). Based on this finding, it can be said that the rules precisely classify the data set (Figure 12.6). The generated rules are listed in Table 12.1.

Table 12.1: The CART rules based on Figure 12.6, in which the clusters of Figure 12.4 are used. Abbreviations: rainfall intensity (rain), soil temperature (St24), soil moisture (Smoist24), groundwater level at point 3 (GWl3). All values are expressed as percentages.


The N&C measurements can be described by two variables related to biological processes, namely, soil temperature and soil moisture, and two variables related to hydrological processes, namely, rainfall intensity and groundwater level at point 3, which represents downslope conditions. Temperature influences the activities of living organisms, such as soil microbial organisms [Zak et al., 1999]. Soil moisture determines microbial activities, such as long-term inactivity in dried soil followed by wetting [Borken/Matzner, 2009]. The groundwater level (or head, in m) is the main factor driving discharge in a catchment [Orlowski et al., 2014]. Rainfall intensity triggers discharge and affects soil moisture as well as leaching of nutrients [Orlowski et al., 2014].

A thorough examination of the CART results based on the five distinguishing rules R (Tab. 1) yields the following classes C:


With regard to N&C, these classes can be distinguished as follows: the first two classes (green and blue) are responsible for normal N&C, the third class (magenta) is associated with low N&C, and the fourth and fifth classes (teal and black) are responsible for high N&C (Figure 12.7).

After a rain shower or on dry days when the ground is wet and hot, the N&C concentrations are normal. The N&C concentrations are high (above 50%) on rainy days, when the downslope groundwater level is above 72%. The N&C concentration is low (<25%) on dry days (below 50% rain) when the ground is cold (below 29% of the maximum ground temperature). These definitions enable future predictions of daily N&C concentrations.

It is assumed here that the structures associated with the 5 clusters described by these classes are defined by discontinuities. Consequently, the clusters should contain samples of different natures and based on different processes. Given this assumption, it is valid to statistically test whether the N&C distributions significantly differ between clusters. The Kolmogorov–Smirnov test (KS test) is a nonparametric two-sample test of the null hypothesis that two variables are drawn from the same continuous distribution [Conover, 1971, pp. 309-314], and it is implemented in the R language [R Development Core Team, 2008].

The statistical results are shown in supplement F, Tab. 1 and 2. All N&C distributions significantly differ between clusters, with the exception of cluster 4 compared with 5, for both variables.

Figure 12.7: Boxplots of the five classes with regard to nitrate N (top) and conductivity C (bottom)*.* All values are expressed as percentages.

#### **12.2 Pain Genes**

In [Ultsch et al., 2016b], a set of genes with relevance to pain74 was obtained from four sources, and the search of several databases and studies (e.g., the Pain Genes Database, the PubMed database) was described in detail. This search yielded a set of n = 535 genes, subsequently referred to as *pain genes* in [Ultsch et al., 2016b].

After accessing the Gene Ontology (GO) database in this work, 528 of the pain genes were found to be annotated, and the remaining seven genes were disregarded in the subsequent analysis (feature selection). Various types of annotation (evidence codes) are possible. When the inverse document frequency *idf* is used [Sparck Jones, 1972], the distances between these genes are defined as follows (as discussed in [Ultsch, 2014b]):

Let the documents be represented by GO terms T, and let the terms used to calculate *idf* be represented by the genes G, which are coded with numbers defined by the National Center for Biotechnology Information (NCBI) [NCBI, 2013]; the term frequency *tf* is then the frequency of occurrence of a gene in a given document divided by the maximal occurrence of the gene in any document:

<sup>74 &</sup>quot;An unpleasant sensory and emotional experience associated with actual or potential tissue damage, or described in terms of such damage" [Merskey/Bogduk, 1994].

$$tf(G,T) = \frac{f(G,T)}{\max(f)}\tag{1}$$

If only manually curated evidence codes are used for annotation, then ݐ݂ሺܩ, ܶሻ = 1.

Let N be the number of GO terms to which the pain genes are annotated, and let ݊݅ be the number of GO terms to which a pain gene with a given NCBI number is annotated; then, the inverse document frequency is defined as

$$idf\_l = \log\left(1 + \frac{N}{n\_l}\right) \qquad \qquad (2)$$

and the term frequency–inverse document frequency is defined as

$$tfidf = tf(G, T) \* idf\_l = 1 \* idf\_l \tag{3}$$

A gene that is annotated to only some GO terms is more meaningful than one that is annotated to almost every or only a few GO terms . Hence, the inverse document frequency reduces the weights of genes that occur very frequently among the GO terms and increases the weight of genes that occur rarely. The distance D between two genes l and j is defined as the absolute distance in terms of *idf*:

$$D(l,j) = \text{abs}(idf\_l - idf\_l) \tag{4}$$

This distance was used to generate the DBS visualization shown in Figure 12.9, and clustering was automatically performed after the identification of 8 clusters in the visualization. The clusters are verified by the heatmap presented in Figure 12.10 and the Silhouette plot in Figure 12.8.

Figure 12.8: Silhouette plot of the DBS clustering of pain genes. Most of clusters of pain genes can be modeled as hyperspheres. However, cluster 6 has a different high-dimensional structure.

Figure 12.9: Topographic map of DBS clustering of 528 pain genes. Clusters 1 and 3 and clusters 2 and 4 are very similar to each other. Cluster 6, labeled in yellow, consists of outliers. The counts per cluster, from 1 to 8, are 72, 99, 75, 133, 53, 21, 58, and 17. For 3D print see supplement G, Figure G.25.

Figure 12.10: Heatmap of the distances with regard to the 8 identified clusters of pain genes, which verifies that the clustering is sound. Clusters 1 and 3 and clusters 2 and 4 are very similar to each other. Cluster 6 is clearly defined by outliers.

# *12.2.1 Prior Knowledge*

The pain genes were analyzed by means of ORA, revealing several important functions, as listed below. If the distance definition and DBS clustering were applied correctly to the pain genes data set, it should be possible to rediscover structures that are already known from two main publications on this topic. [Lötsch et al., 2013] defined twelve functions of pain for 460 pain genes (Figure 12.11) [Lötsch et al., 2013]:


Additionally, in 2016, twelve chronification functions of 535 pain genes were identified [Ultsch et al., 2016b]:


With the aim of reproducing the knowledge listed above, for every cluster in Figure 12.9, ORA was performed using the R package ORA [Lippmann et al., 2016]. The resulting p-values were filtered via ABC analysis, and thereafter, only group A was considered for interpretation (see chapter 9 for further details).

# *12.2.2 Knowledge Acquisition in Clusters of Pain Genes*

DBS identified eight clusters75 of genes (Figure 12.9). For each cluster, an ORA was performed. In contrast to the standard approach, in which the Bonferroni correction [Perneger, 1998] is

 75 After inspection of the functional areas in the eight ORA results, the eight clusters could be reduced to six (for details, see Tab. 2

often used, here, the p-values of the GO terms in the ORA results were filtered via ABC analysis [Ultsch/Lötsch, 2015]. The Bonferroni correction reduces the alpha error of significance, but it may cause valid results to be disregarded because the beta error simultaneously increases (for extensive discussions, see [Button et al., 2013; Nuzzo, 2014; Perneger, 1998]. Here, it is argued that in the special case of ORA, the p-values also represent the effect strength. Therefore, the adjustments to the significance threshold made by the Bonferroni correction are unnecessary. In contrast to the standard approach, ABC analysis was used to identify the most important GO terms as those assigned to group A, which had the highest effect strength. After the reduction of the directed acyclic graph (DAG) using this approach, the functional areas identified in [Lötsch et al., 2013] and [Ultsch et al., 2016b] were found to be associated with three of the classes (Table 12.2).

Considering the prior knowledge regarding pain functions and pain chronification, the following clusters could be combined: cluster 1 and cluster 3 were combined to class C1\*, and cluster 2 and cluster 4 were combined into class C2\*, because they showed similar functions and were separated only by low borders in the topographic map with hypsometric tints (Figure 12.9). Hence, it was possible to identify five classes with different semantic characterizations, plus one class of outliers (Tab. 2). Class C1\* predominantly describes the pain functions of cells and reproduces knowledge presented in section 11.2.1. The main class (C2\*) describes the molecular transport and signaling of pain, also reproducing prior knowledge about the pain genes. class C5 represents the downregulation of metabolic processes and the upregulation of the creatine metabolic process, which is a new discovery enabled by the DBS clustering. Class C6 describes outliers that are not relevant to the ORA-based DAG — these outliers are surrounded by very large hills in Figure 12.9. Class C7 characterizes the response and regulation systems as well as the upregulation of the phosphorus metabolic process, effectively reproducing the results of [Lötsch et al., 2013] and [Ultsch et al., 2016b]. The final class, C8, could represent hematopoietic stem cell differentiation. In summary, these clusters reproduce the previously identified functions of pain genes as described in section 11.2.1. In addition, new insights can also be found from class C5 and perhaps class C8.

Figure 12.11: The biological process of pain with the twelve functions of pain genes [Lötsch et al., 2013].



License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. **Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **13 Discussion**

This work examined and analyzed patterns in high-dimensional data characterized by discontinuity. Such distance- or density-based patterns are either compact or connected structures. If the structures are compact, inter- versus intracluster distances are relevant. If they are connected, then density relations and neighborhoods play an important role. Here, it was demonstrated that the neighborhood of a point can always be defined based on graph theory. If the neighborhoods are defined based only on distance, then the structure is compact and a Euclidean graph can be used. If the structure is connected, then two subtypes can be deduced from graph theory: direction-based and unidirectional neighborhoods.

In the context of cluster analysis, structures induced by discontinuities lead to natural clusters, as elaborated in chapter 3. The definition of discontinuity in high-dimensional data, presented in chapter 2, enables the generalization of spatial separation, which was described by [Handl et al.] as a third category of clustering criteria [Handl et al., 2005, p. 3202]. Here, in contrast to [Handl et al., 2005], it is argued that there is no distinction between connected and spatially separated structures or between compact and spatially separated structures76. Instead, the third category (spatial separation) can be generalized as the prerequisite for natural clusters defined by either compact or connected structures. It was discussed in chapter 3 that, through the application of basic principles founded on graph theory, clustering algorithms usually search for clusters with a predefined structure. However, it is not always clear which structures are sought because the objective functions that are optimized can be mathematically very difficult to understand. An extensive evaluation of the objective functions found in the literature supports this argument and implies two subtypes of structures sought by common clustering algorithms, called direction-based and unidirectional structures. The assumptions put forward in chapter 3 (Figure 3.5) were verified in chapter 10 (Table 10.1) using data sets from the Fundamental Clustering Problems Suite (FCPS). A question arises regarding how one can choose a clustering algorithm that assumes the correct type of cluster structure for a high-dimensional data set without prior knowledge. Here, it is suggested that dimensionality reduction methods for generating (two-dimensional) projections may help solve this problem.

This work has demonstrated that the objective functions used in clustering and projection methods and the quality measures (QMs) used to evaluate them are based on the fundamental distinction between connected and compact structures. The conclusion is that when the task is to achieve a structure-preserving visualization or clustering, the optimization of an objective function could yield misleading results if the underlying structures of the high-dimensional data of interest are unknown. Hence, a completely different approach is required, which, in chapter 7, motivates an extensive review of the application of artificial intelligence in data science. In chapter 7, two interesting concepts are addressed, called self-organization and swarm intelligence. Through self-organization, the irreducible structures of high-dimensional data can emerge, in a process defined as emergence in chapter 7. If properly applied using a swarm of intelligent agents, the approach presented in this work can outperform the optimization of an objective function for the tasks of clustering and dimensionality reduction.

<sup>76</sup> In [Handl et al.], the three categories of clustering criteria were called connectedness, compactness and spatial separation [Handl et al., 2005, p. 3202].

#### **The Databionic Swarm (DBS) method**

*"[A clustering approach] must be adaptive or exhibit 'plasticity,' possibly allowing for the creation of new clusters, if the data warrants it. On the other hand, if the cluster structures are unstable […], then it is difficult to ascribe much significance to any particular clustering. This general problem has been called 'the stability/plasticity dilemma' " [Duda et al., 2001, p. 559].* 

The work presented herein introduces a clustering algorithm based on a swarm-based projection method combined with a human-understandable visualization technique. In terms of stability and plasticity (chapter 10, Figure 9.1), the Databionic swarm (DBS) framework outperforms common algorithms in clustering tasks on the FCPS.

*"One source of this dilemma is that with clustering based on a global criterion, every sample can have an influence on the location of a cluster center, regardless of how remote it might be" [Duda et al., 2001, p. 559].* 

In contrast to standard approaches, swarm techniques are known for their properties of flexibility and robustness [Bonabeau/Meyer, 2001; Şahin, 2004]. As a swarm technique, DBS clustering is robust with respect to outliers (see chapter 10).

DBS is a flexible and robust clustering framework that consists of three independent modules. The first module is the parameter-free projection method Pswarm, which exploits the concepts of self-organization and emergence, game theory, swarm intelligence and symmetry considerations. The second module is a parameter-free high-dimensional data visualization technique, which generates projected points on a topographic map with hypsometric colors, called the generalized U-matrix. The third module is a clustering method with no sensitive parameters. The clustering can be verified by the visualization and vice versa. The term DBS refers to the method as a whole. DBS enables even a non-professional in the field of data mining to apply its algorithms for visualization and/or clustering to data sets with completely different structures drawn from diverse research fields, simply by downloading the corresponding R package [Thrun, 2017].

Each module of DBS was compared with various competing algorithms, and in the majority of cases, the modules outperformed those algorithms. However, the author of this work concurs with [Coretto/Hennig, 2016] that despite one's best intentions and efforts to conduct fair comparisons of various methods of visualization, projection and clustering, "ultimately it would be good to have comparisons of methods run by researchers who did not have their hand in the design of any of the methods"; this is because "(simulation) studies can always be designed that make any method 'win.' " The author also agrees with [Coretto/Hennig, 2016] that "readers need to make up their own mind about to what extent our study covered situations that are important to them."

With these considerations in mind, DBS was particularly designed to be flexible and to allow the modules to be interchangeable. An expert in the field of data mining may prefer a method with a clear optimization strategy or may not require the entire DBS framework for his/her application. The interchangeability of the modules is useful in such a case. For example, it is possible to use the visualization and clustering module with NeRV instead of Pswarm. Alternatively, a user could cluster a data set using his/her preferred clustering algorithm and then verify the clusters visually using Pswarm and the generalized U-matrix. As another example, a user could use Pswarm and its clustering algorithm with no visualization, by setting the number of clusters with the aid of the dendrogram of the swarm-defined distances. In summary, the

projection based clustering framework proposed here is a user-friendly platform for the visualization of high-dimensional structures and/or for clustering with no sensitive parameters.77

#### **Clustering with DBS**

*"[T]he majority of clustering algorithms […] impose a clustering structure on the data set X, even though X may not possess such a structure" [Theodoridis/Koutroumbas, 2009, p. 863].* 

 Additionally, they may return meaningless results in the absence of natural clusters [Cormack, 1971, pp. 345-346; Handl et al., 2005, p. 3203; Jain/Dubes, 1988, p. 75]. The results presented in this work illustrate that the DBS algorithm does not suffer from these two disadvantages. The DBS algorithm makes it possible to apply the abstract U-matrix (AU-matrix) [Lötsch/Ultsch, 2014] to a Pswarm projection instead of an emergent self-organizing map (ESOM) projection. The new clustering approach of DBS is defined by using the shortest-path distances [Dijkstra, 1959] of the AU-matrix and a hierarchical approach to clustering. In contrast to swarm-organized projection (SOP) and ESOM, this approach does not require any parameters except the number of clusters and a two-option parameter that specifies the cluster structure as being either compact or connected (see chapter 3 for details). "One of the most difficult decisions to make is the number of clusters" [Everitt et al., 2001, p. 179]. In DBS, the number of clusters and the cluster structure can be easily estimated from a careful examination of the topographic map (by counting the valleys) and with the help of a dendrogram. If the number of clusters and the cluster structure are chosen properly, then the clusters in the topographic map will be well separated by mountains.

It is argued here that DBS clustering should be semi-interactive and requires user supervision to achieve the best possible results. Nevertheless, the results of automatic DBS clustering with no user intervention were also compared with the results of the common clustering algorithms k-means [MacQueen, 1967], partitioning around medoids (PAM) [L. Kaufman/Rousseeuw, 1990], single linkage (SL) [Florek et al., 1951] and spectral clustering [Ng et al., 2002] as well as two state-of-the-art clustering algorithms: the mixture of Gaussians (MoG) method [Fraley/Raftery, 2002] and the Ward algorithm [Ward Jr, 1963]. "Several of the comparative studies […] conclude that Ward's method […] outperforms other hierarchical clustering methods" [Jain/Dubes, 1988, p. 81]. MoG clustering, which is also known as model-based clustering, serves as the reference technique [Bouveyron/Brunet-Saumard, 2014]. Clustering algorithms such as DBscan [Ester et al., 1996] or the ESOM/U-matrix approach [Ultsch et al., 2016a] require additional sensitive and continuous parameters and were omitted from the comparison for that reason. Every clustering algorithm was applied using the default parameter settings and the correct number of clusters. Calculations were performed for 100 trials on the FCPS data sets [Ultsch, 2005c].

The main result achieved in the work presented herein concerns the error rates of the clustering algorithms tested in these trials. As already stated throughout this work, clustering algorithms often predefine the structure of the clusters they seek; e.g., for PAM and k-means, the shape is round, and thus, the structure is compact. Therefore, these algorithms failed on the Chainlink and Atom data sets. In addition, the k-means and spectral clustering algorithms showed large

<sup>77</sup> After this work it was also made available in [Thrun et al., 2017, Thrun/Ultsch, 2017a].

variances in their results on the Hepta and Target data sets. It is known that the k-means algorithm sometimes strongly depends on the order of objects in a data set [L. R. Kaufman/Rousseeuw, 2005, p. 114], which may be the cause of the large variance in the results. This variance was shown through several examples for the spectral clustering algorithm, in which case the results were strongly trial-dependent, even when the parameter settings remain unchanged. The MoG method yielded results of comparably good quality to those of DBS, but it still failed in the case of the Lsun3D data set (in the sense that it showed a large variance) and in the case of the Target data set and its outliers. The MoG approach uses the expectation maximization (EM) algorithm, which is known to be subject to such problems on univariate data sets [Ultsch et al., 2015]. Notably, only "if the underlying distribution comes from a mixture of component densities described by a set of unknown parameters" can it be estimated using MoG approaches [Duda et al., 2001, e.g. p. 581]. This is the case for the FCPS data sets, resulting in high performance of the MoG algorithm. However, natural data sets do not necessarily satisfyhave to meet this assumption. Additionally, the MoG method fails if the dimensionality of the data set is too high (chapter 3).

The automatic DBS clustering showed a small variance in its results and yielded good accuracy for all data sets. In contrast to all other approaches, in every trial in which the clustering accuracy of DBS was worse than that of some other algorithm, its performance could be improved by using the semi-interactive approach. The reason for this ability to improve the results of DBS lies in the main advantage of DBS clustering, namely, the possibility of verifying the clustering results through visualization, as described below. For a clustering algorithm, it is relevant to test for the absence of a cluster structure [Everitt et al., 2001, p. 180], or the clustering tendency [Theodoridis/Koutroumbas, 2009, p. 896]. Usually, tests for the clustering tendency rely on statistical tests [Theodoridis/Koutroumbas, 2009, p. 896]. Unlike other hierarchical clustering algorithms (except for ESOM/U-matrix clustering [Ultsch et al., 2016a]), the DBS algorithm finds no clusters if no natural clusters exist. The clustering tendency is visualized by the generalized U-matrix.

#### **Generalized U-matrix visualization and structure preservation**

The technique of producing visualizations in the form of a two-dimensional scatter plot of projected points currently remains the state of the art in cluster analysis (e.g., [Hennig et al., 2015, pp. 119-120, 683-684; Ritter, 2014, p. 223]). However, such a two-dimensional visualization can lead to a misleading interpretation of the underlying structures because the low-dimensional similarities do not completely represent the high-dimensional distances in two dimensions. Two types of error have been identified in the literature (see chapter 5): forward projection error (FPE) and backward projection error (BPE) [Aupetit, 2007; Ultsch/Herrmann, 2005; Venna et al., 2010]. In addition to these errors, this work introduces the concept of structure preservation, which is the preservation of high-dimensional discontinuities such that no points are allowed to intrude into the discontinuity regions of the two dimensional projection.

The FPEs and BPEs were visualized for various projection methods using a two-dimensional gray-scale U-matrix visualization in [Ultsch/Mörchen, 2006]. Such a gray-scale U-matrix is the most commonly used method for displaying dissimilarities in SOMs [K. Tasdemir/Merenyi, 2009, p. 550; Kadim Tasdemir/Merényi, 2012, p. 3]. Here, the idea was to "apply Self-Organizing Map training without changing the best matching unit [prototype] assignment" [Ultsch/Mörchen, 2006, pp. 3-4] through the transformation of projected points into best matching units, as introduced in this work. Unlike the approach of Ultsch and Mörchen, the newly proposed simplified ESOM (sESOM) algorithm does not require a learning rate, and the cooling scheme is defined by a special neighborhood function based on symmetry considerations, which results in a parameter-free algorithm (cf. [Ultsch/Mörchen, 2006, p. 4]). This makes it possible to visualize SOMs as topographic maps with hypsometric tints [Thrun et al., 2016a], which serves as a basis for a visualization technique that can be applied in combination with any projection method. The third dimension is used to visualize the local BPE and FPE around each projected point in precisely defined height-dependent colors, thereby giving rise to the generalized U-matrix, which is a generalization of the U-map concept [Ultsch, 2003a].

Here, it is argued that the generalized U-matrix visualization of a topographic map (second DBS module) is able to visualize both compact and connected structures. In terms of the preservation of high-dimensional structures, it is a suitable approach for visualizing the BPEs, FPEs and discontinuities in a data set. However, as shown in Fig. 5.6 in chapter 5, this visualization technique has certain limitations. If additional gaps with intruding points are added by the projection method, then the generalized U-matrix is not able to distinguish identical clusters from distinct ones. To the author's knowledge, the only visualization that shows whether clusters have been disrupted uses a linear gray-scale approach based on a holistic solution called the proximity measure [Aupetit, 2007]. In the two-dimensional projected space, Voronoi cells are filled with brighter or darker luminances depending on their high-dimensional distances D to a reference point. "Points with bright cells are connected in the original space" [Aupetit, 2007, p. 17]. However, cluster disruption can only be successfully visualized when the user selects the correct reference point. To estimate the correct reference point for a projected space, additional visualizations of other measures, as introduced in this paper, must be used. Consequently, this process is both time-consuming and challenging and requires user supervision.

Many quality criteria exist for evaluating the visualization of a scatter plot. Chapter 6 addressed the question of whether the currently existing QMs are able to measure structure preservation. By using a generalized, graph-theory-based definition for a neighborhood of points, it is possible to group the QMs based on their semantic characterization. Here, 19 common QMs were reviewed and grouped, and they were compared with regard to their ability to measure the structure preservation of a projection. It is argued here that the QMs that have been presented in the literature have difficulty correctly capturing the discontinuities in high-dimensional data because of their inherent assumptions regarding the underlying high-dimensional structures. This was shown using the Hepta and Chainlink data sets in supplement *A*.

Otherwise, an objective function could be defined using the "best" QM, and it would always be possible to obtain a structure-preserving two-dimensional visualization by optimizing this objective function. In this work, no answer could be found to the question of how the quality of structure preservation can be automatically measured or visualized without prior knowledge.

However, when a prior classification of the data is available, it can be used to evaluate the quality of structure preservation. The structures that should be preserved are defined by such a classification. A QM called the Delaunay classification error (DCE) was developed based on this concept; it allows projections to be ranked and normalized compared with a baseline and also enables statistical testing.

In summary, structure preservation depends on the chosen projection method; however, the task of choosing the correct projection method is challenging because the optimization of an objective function requires the predefinition of the structures to be visualized. The generalized Umatrix is able to visualize the similarities and dissimilarities among high-dimensional data points in a scatter plot of the projected points (BPEs and FPEs), but it is unable to visualize the disruption of clusters, based on which the quality of structure preservation is defined.

#### **The projection method Pswarm**

The first module of the DBS framework is called Pswarm. Pswarm is a projection method that does not rely on an objective function. Similarly to SOP, Pswarm uses stigmergy and a swarm of DataBots because swarm techniques are known for their properties of flexibility and robustness [Bonabeau/Meyer, 2001; Şahin, 2004]. However, in contrast to SOP, which uses an ESOM-like grid space, the environment of the DataBots in Pswarm has been redefined based on symmetry considerations [Feynman et al., 2007, pp. 147-153, 745], resulting in the use of polar coordinates on a toroidal hexagonal grid. The combination of symmetry considerations with game theory concepts endows the polar swarm (Pswarm) with a parameter-free annealing process and an automatically selected, data-driven grid size.

The insights presented in chapter 7 demonstrate that Pswarm exhibits both self-organization and swarm intelligence. In the swarm-based techniques presented in the available literature, the swarms used for projection and/or clustering do not take advantage of both concepts (chapter 7.3, Figure 7.4). Moreover, no other reported swarm method exploits game theory or the phenomenon of emergence (as defined in chapter 7, section 3, after [Ultsch, 2007]). Here, the focus is placed on a subfield of dimensionality reduction in which projection methods are used for visualizing high-dimensional data in a two-dimensional space, as opposed to manifold learning methods, which are designed only to find manifolds, not to compress them into two-dimensional space [Venna et al., 2010, p. 2].

Of the methods of projecting high-dimensional data into two-dimensional space, two stand out: Neighborhood Retrieval Visualizer (NeRV) [Venna et al., 2010] and ESOM [Ultsch, 1999]. NeRV optimizes the objective function that quantifies the cost, defined as information retrieval, with the goal of visualizing the similarity relationships between data points. NeRV attempts to achieve a faithful representation of the data in two dimensions by minimizing the BPE and FPE. The cost is a tradeoff between the FPE and BPE78, which is defined by the parameter ߣ. ESOM is an unsupervised neural learning algorithm and can be used as a projection method if a large number of neurons is specified. ESOM remains a reference tool for two-dimensional visualization [Lee/Verleysen, 2007, p. 244]. Instead of an objective function, ESOM uses the powerful concept of emergence [Ultsch, 2007] in addition to the 3D visualization technique of [Thrun et al., 2016a], which is based on the U-matrix [Ultsch, 2003a]. Both NeRV and ESOM are stateof-the-art methods for the visualization of high-dimensional data.

Pswarm was compared with the following common projection methods: principal component analysis (PCA), curvilinear component analysis (CCA), t-distributed stochastic neighbor embedding (t-SNE), ESOM, NeRV and the multidimensional scaling (MDS) technique of Sammon mapping. Five artificial three-dimensional data sets from the FCPS were used to compare these projection methods because of their clearly defined natural clusters. Typically, the QMs

<sup>78</sup> In information retrieval terms, precision and recall.

discussed in the literature indirectly assume that a projection method has a deterministic outcome. A problem that has, thus far, remained undiscussed is the stochastic outcomes of some common projection methods, such as t-SNE and CCA. Therefore, the DCEs were calculated for 100 trials per projection method and data set. Thus, the outcomes of the projection methods could be statistically compared. To enable an unbiased comparison, the DCE requires a prior classification that defines the structures in a data set. However, as discussed by [Färber et al., 2010], natural data sets may have more than one useful classification, depending on the context and the algorithm applied, because no universal definition of a cluster exists [Hennig, 2015b, p. 705]. Therefore, the evaluation of different projections methods by DCE only makes sense on artificial data sets with predefined natural clusters (see chapter 9). This is a major limitation of the DCE QM.

It was shown that the two-dimensional projections generated by Pswarm are comparable to those produced by the state-of-the-art methods NeRV and ESOM. To the author's knowledge, every projection method considered here (except ESOM and SOP) optimizes an objective function, which may lead to the disadvantages discussed above. Moreover, some projection methods, such as ESOM and CCA, use a sophisticated annealing scheme that may be sensitive to one or more parameters or have one or more sensitive parameters themselves (e.g., ߣ in NeRV). Examples are given in chapter 10.2, Tab. 10.1. In contrast to NeRV, Pswarm is not sensitive to any parameter or, as in the case of ESOM, to an annealing scheme and lattice size. It was shown that a projection with minimal BPE and FPE values does not necessarily achieve structure preservation. In the case of NeRV, it was shown that this algorithm is sensitive to its random initialization process (chapter 5, Fig. 5.6, and chapter 10). Venna et al. also proposed an alternative PCA-based initialization [Venna et al., 2010, p. 459], which in itself makes prior assumptions regarding the relevant structures of the high-dimensional data79, as illustrated by the baseline used to analyze the DCE results (see chapter 10.2 Figure 10.5). Unlike NeRV, Pswarm does not visualize cluster structures if such structures do not exist in the data, as in the case of the Golf Ball data set (or the various continuous data sets presented in supplement D); moreover, because Pswarm is a swarm-based technique, it is more robust to the random initialization process (e.g., the DBS visualization of the leukemia data set in chapter 11, Figure 10.1).

In the third section of chapter 10, the SOP algorithm is emphasized because it is another method based on a swarm of DataBots, as introduced in [Herrmann, 2009]. In [Herrmann, 2011], it was shown that SOP is nearly as good as or even better than the best of its carefully parameterized competitor methods, namely, CCA, t-SNE and ESOM, in terms of the 1-nearest-neighbor classification accuracy and the specially formulated dispersion measure of [Herrmann, 2011, p. 101]. It was also noted that these methods resulted in severe misrepresentations of the structures for several data sets, which was not the case for SOP (see also the scatter plots in section A2 of [Herrmann, 2011, pp. 158-161]).

Notably, the annealing process of the SOP algorithm is not truly self-adaptive; rather, it is parameterized, which can lead to severe errors in the projections. In the best case, the choice of the lattice size and, therefore, the maximal neighborhood radius as well as the choices of the two magic numbers (the jumping DataBots threshold and the maximum number of iterations) in the SOP algorithm have only a minor effect on the visualization of the high-dimensional

<sup>79</sup> PCA maximizes the variance.

structures (as in the cases of the Atom and Chainlink data sets). In the worst case, as for the EngyTime or Iris data set, all structures are prevented from emerging. Moreover, in the case of EngyTime, it was shown that when there is no restriction ensuring that no more than one Data-Bot can occupy each lattice position, the information about the high-dimensional structure is lost. Unlike the dispersion measure and 1-nearest-neighbor classification approach of Herrmann, in comparison with SOP and based on a topographic map of projected points, the visualizations presented in this work illustrate important improvements achieved by Pswarm, which are described in the last section of chapter 10.

Several examples were presented to demonstrate that the process leading to emergence is disrupted in the SOP algorithm. Other swarms do not exhibit self-organization but instead rely on the optimization of an objective function, which makes emergence impossible. To the author's knowledge, the game theory approach to behavior-based systems remains undiscussed in the available literature on artificial intelligence in data science. The naturally clustered Wine, Swiss Banknotes and Iris data sets all illustrate the importance of consistent and appropriate definitions of the neighborhoods, scents, grid or lattice size and data-driven annealing scheme used for clustering and projection. If these definitions are oblique, as is the case for SOP, then the self-organization of the DataBots is disrupted. The ultimate disruption of the process leading to emergence may be minor (Swiss Banknotes) or major (Wine, Iris), depending on the data set and the specific trial. For the Wine data set, Pswarm gains an advantage because of the ability to choose different a distance whereas the SOP algorithm does not. [Herrmann, 2011, p. 65]. Pswarm allows the user to define a non-metric distance method without any restrictions.

The correct selection of the parameters for the annealing scheme requires an experienced user. For example, it was shown that with the default settings, the ESOM algorithm sometimes projects three, instead of two, clusters for the Atom data set (chapter 5, Fig. 5.6). To further substantiate this argument, additional ESOM projections generated with the default parameters are presented in Supplement E. For example, it is necessary to change the lattice type from toroidal (default) to planar to achieve a correct projection of the Wing Nut data set. If the default parameters are not changed, the structures are very difficult to see. Disruption of the clusters can be seen in the ESOM/U-matrix visualizations of the Iris, Wine, and Swiss Banknotes data sets, in which one or more of the other eight parameters play an important role (see supplement C for these U-matrix visualizations).

Thus, it is argued here that the ESOM/U-matrix projections of the EngyTime, Wing Nut, Iris, Wine and Swiss Banknotes data sets may be misleading because the toroidal ESOM projections are computed without accounting for symmetry considerations, which results in unwanted boundary effects. For example, the maximal radius is set to the diagonal length80 √ܮଶ ܥଶ instead of ܮ / 2, which leads to overlapping of the neighborhoods if the neighborhood function is defined as Gaussian. Several examples illustrate that the uniform distribution used in the ESOM and SOP algorithms has no advantages; however, it may have some disadvantages. The attempt to distribute the projected points uniformly on the lattice is useful only if a visualization method is able to reveal the high-dimensional structures of the data. For this reason, the Umatrix visualization [Ultsch, 2003a] is mandatory for ESOM projections. In other cases, uni-

<sup>80</sup> L is the number of lines in the grid, and C is the number of columns.

formly distributed projected points do not lead to new knowledge about the data set. By contrast, for the generalized U-matrix, there is no requirement for the projected points to be uniformly distributed. Consequently, Pswarm outperforms ESOM on density-based data sets such as EngyTime.

Being a swarm-based method, DBS suffers from the disadvantage of high computational costs. When the number of DataBots81 is greater than 4000, the use of Pswarm is impractical because of the long calculation time. Further research is necessary on the application of game theory as the foundation for a data-driven annealing scheme. At this point, it can be proven only that a weak Nash equilibrium will be found [Nash, 1951], which may be the reason for the high variance observed in the DCE results (chapter 10, section 2). Only with DBS clustering can the variance of the results be noticeably improved. The structures of 14 of the investigated data sets were preserved using Pswarm (chapters 10 and 11).

The main drawbacks of the proposed approach are as follows. If no prior classification is available for a data set, then the use of DCE measure is limited. Thus, it is very difficult to evaluate whether Pswarm and the generalized U-matrix produce a structure-preserving visualization or whether the clusters are disrupted in the visualization. Additionally, the variance of the results remains high: because it is a stochastic projection method, two different trials of Pswarm could yield different visualizations of the same data set. If the number of clusters is known beforehand, *deep swarming* may be able to solve this problem, as the Tetragonula data set demonstrated82. Moreover, it should be possible for the swarm to iteratively add new data points during or after the algorithm following a well-defined process**.** At present, the Pswarm algorithm is unable to do this. Briefly, it was demonstrated in sections 2 and 3 of chapter 10 that finding the correct grid or lattice size and annealing scheme for ESOM/SOP may be challenging. It should be emphasized that unlike SOP and, especially, ESOM (see supplement C and E), Pswarm is able to successfully project density-based data sets. The comparison between Pswarm and the other common projection methods with their default parameter settings resulted in two major findings. First, the state-of-the-art methods ESOM and NeRV do not outperform Pswarm, and second, Pswarm has one important advantage, namely, that it is parameter-free. However, if prior knowledge of the data set to be analyzed is available, then a projection method that is appropriately chosen with regard to the structures that should be preserved will always outperform Pswarm. Furthermore, other projection methods may also outperform Pswarm if their settings are carefully selected by an experienced user. In summary, to the author's knowledge, Pswarm is the first swarm-based technique to show emergent properties while simultaneously combining swarm intelligence, self-organization and game theory.

#### **Knowledge discovery with DBS**

Up to this point, mainly artificial data sets have been used to assess the capabilities of DBS. In the case of natural data sets, only the prior classifications were considered. However, the introduction of a new clustering method is necessary only if it is useful. Therefore, three complex real world data sets were first analyzed using DBS to confirm its ability to reproduce known knowledge. Subsequently, two high-dimensional data sets were clustered using DBS to obtain

<sup>81</sup> Which is equal to the number of high-dimensional data points.

<sup>82</sup> for details see next section or chapter 11, section 3.

new knowledge. The silhouette plots and the heatmaps, which showed small intracluster distances and large intercluster distances, indicated that the clustering results for all five data sets were valid.

The visualization and connected clustering of the high-dimensional83 leukemia data set, which contains clearly defined natural clusters (see chapter 3), successfully reproduced the diagnoses of three types of leukemia: acute myeloid leukemia (AML), acute promyelocytic leukemia (APL) and chronic lymphocytic leukemia (CLL). Aside from two outliers (patients), the prior classification of healthy patients and patients diagnosed with the three leukemia subtypes was reproduced by the DBS clustering and visualization. The two outlier patients may be misdiagnosed; however, a future publication will address this diagnostic problem. Chapter 6 showed that aside from ESOM, no other common projection method was able to visualize the predefined cluster structure of this data set. Similarly, in chapter 3, it was demonstrated that common clustering algorithms failed to correctly cluster the leukemia data set, with the exception of the Ward algorithm, which was not able to find the two outliers.

When the dynamic time-warping distance definition was applied on a data set consisting of the gross domestic product (GDP) per capita in 190 countries for the years 1970–2010, two clusters and one outlier were found using DBS. Upon the application of Classification and Regression Tree (CART) analysis, it was found that the two clusters could be explained as being distinguished by the influence of the tragic event of planes crashing into the World Trade Center in 2001.

DBS found 10 clusters in the Tetragonula data set, as verified by the heatmap and silhouette plot. When the largest within-cluster gap, the cluster separation, and the average within-cluster dissimilarity of [Hennig, 2014] were calculated, the resulting values were the minima reported in [Hennig, 2014], presented there in Fig. 4. The 10 identified clusters strongly depended on the locations of the bees (chapter 11, Figure 11.8). Additionally, the application of DBS to this data set illustrated the possibility of using multiple swarms by means of parallel computing, for which the term *deep swarming* (see [Ultsch, 2016b]) is introduced here in analogy to deep learning [Goodfellow et al., 2016]. Here, deep swarming was applied with a DCE-based objective function, but it can also be applied in combination with any arbitrary objective function.

For the hydrology data set, the daily courses were analyzed. After preprocessing, DBS identified five distinct clusters (chapter 12, Figure 11.4), which were verified by the heatmap and silhouette plot. The rules extracted from a CART decision tree were applied to the clustering of this data set and found to result in the misclassification of 0.9% of the points (chapter 12, Figure 12.6). Five different water quality states in terms of nitrate concentration and electrical conductivity were identified based on a semantic characterization of these clusters (chapter 12, Figure 12.7). The extracted rules enable the prediction of future nitrate and electrical conductivity conditions.

For the pain gene data set, focus was placed on the task of clustering the pain genes. The distances between genes were defined based on the inverse document frequency (idf) [Sparck Jones, 1972] and the information available in the Gene Ontology (GO) database. The DBS clustering resulted in eight clusters (Figure 12.9). Five clusters reproduced the previously known functions of the pain genes (Tab 12.2), as described in section 12.2.1. Outliers were

<sup>83</sup> Containing 7747 variables.

found in two clusters, and one cluster yielded new discoveries regarding the functions of pain genes (Tab 12.2, C5). This cluster was characterized by the downregulation of metabolic processes and the upregulation of the creatine metabolic process.

*"The experience from many knowledge discovery tasks ([Behnisch/Ultsch, 2009; Kupas et al., 2004; Lötsch/Ultsch, 2013; Mörchen et al., 2005]) is that about 80% of clusters coincide with known processes. Typically about 10% may be attributed to erroneous data, while the remaining 10% may generate entirely new knowledge" [Behnisch/Ultsch, 2015, p. 68].* 

This experience is consistent with the findings obtained in the above examples. Two domain experts found the results presented above to be valid and useful.

License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. **Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **14 Conclusion**

A new and data-driven approach for cluster analysis and visualization is introduced in this work. The projection based clustering combines structures preserved in two dimensions with underlying high-dimensional structures (see also [Thrun et al., 2017, Thrun/Ultsch, 2017a]). It is a flexible and robust approach for cluster analysis that consists of three independent modules which can be optionally combined into the Databionic swarm (DBS). Here, the attention is focused on data for which the generation process is complete and for which the size and amount of information can be managed using a personal computer with standard hardware; consequently, the realm of Big Data is not discussed here. To the author's knowledge, DBS is the first swarm-based technique showing emergent properties while simultaneously exploiting the concepts of swarm intelligence, self-organization and the Nash equilibrium concept from game theory, which results in the elimination of a global objective function and of the setting of parameters.

Alternatively, the visualization by the generalized Umatrix and the DBS clustering can be applied to every projection method for connected or compact structures based on discontinuities of high-dimensional data [Thrun/Ultsch, 2017a]. Through the use of the generalized Umatrix visualization, results of common clustering methods can be verified by the structures found by the data-driven Pswarm or any other projection method.

This work introduced the fundamental principle of considering compact versus connected structures in the clustering of data. However, in this context, only unsupervised indices, called QMs for projection methods, were analyzed. A similar analysis of supervised indices should be conducted in the future with the help of the FCPS. There is sufficient literature available to do so (e.g., [Charrad et al., 2012; Dimitriadou et al., 2002; Handl et al., 2005]).

Another goal of future research should be to find a strong Nash equilibrium. However, a strong Nash equilibrium is mathematically difficult to prove. In the opinion of the author, if each Data-Bot were able to assess all possible jump positions in a given neighborhood instead of only four, then a strong Nash equilibrium could be achieved. However, the time complexity of this approach is too high for practical testing unless the algorithm is parallelized. Additionally, deep swarming should be extensively tested.

Symmetry considerations were applied to the two-dimensional toroidal output space, resulting in the use of polar coordinates in the DBS framework. Additionally, it should be possible to explore and exploit connections with solid-state physics. Perhaps it would be beneficial to define the Bravais lattice, apply a Fourier transformation to the reciprocal lattice [Hunklinger, 2009, pp. 83-88], and perform calculations in the reciprocal space, where boundary effects could be easily eliminated and a low computational time complexity could be achieved. Further research on these possibilities is required.

License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. **Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **References**


**Open Access** This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Appendices**

The following section are additions to the various chapters. Supplement A evaluates various QMs on the examples of the Hepta and Chainlink data sets. Supllement B illustatres an highdimensional example of a bimodal distribution of distances explained in chapter 3 (see Fig. 3.1). Supplement C to D show all visualizations of ESOM, SOP and Pswarm of various data sets introduced in chapter 9. Most importantly it is illustrated that Pswarm does not find any structure if such a structure does not exist in a data set (supplemen D). Supplement G shows additions 3D prints of Pswarm visualizations. Supplement F, H and I complement results of this work with further (mostly statistical) comparisons and testings.

# **Supplement A: Evaluation of Common QMs**

The following section unravels the pitfalls of quality measures based on two different examples: Hepta and Chainlink. They will demonstrate that no quality measure is generalizable because every quality measure (QM) assumes the underlying structure of the data set. If this were not the case the minimizing of a QM would lead to the best possible projection of every data set. Both data sets are defined by discontinuities: Hepta is a data set with compact structures whereas Chainlink is a data set with connected structures.

# **First Example: Hepta**

For example, three projections methods for the Hepta data set are chosen: PCA, CCA and t-SNE. Overall, four projections are evaluated denoting the two projections of t-SNE with *t-SNE (1)* and *t-SNE (2).* Visually the results are depicted in chapter 5, Figure 5.2, where the seven class labels refer to the colors of the points.

PCA has the highest structure preservation. With default parameters CCA adds gaps of around 3 points. In t-SNE (1) projection the density of the data is overestimated and wide gaps are also added between two points and their cluster, if the default parameter setting is used. By changing one parameter of t-SNE, the t-SNE (2) projection is not able to preserve the structures of data, because many gaps are randomly added.

In Figure A.1 curves of Trustworthiness and Continuity (T&D) are drawn for the four projections of the Hepta data set. The best quality of structure preservation was achieved by PCA (see supplementary), however the curves tend to prefer CCA over PCA. If one plotted only the first 25 k nearest neighbors, t-SNE (1) would reach the best results. Out of the four cases, the T&D is finally able to distinguish the worst case of a low structure preservation of t-SNE (2).

In Table A.1 Topological Index (Spearman's error) and Cpath fail to distinguish the four cases. Topological Correlation (TC) is able to distinguish t-SNE (2) from the other three cases. Cwiring is able to distinguish the four cases, but the difference in values between CCA and PCA is very small. Additionally, without a normalization scheme different data sets would be incomparable. The Classification error with knn=5 is able to rank the PCA projections as the best one and t-SNE (2) as the worst, but prefers t-SNE (1) over the CCA projection.

Calculating AUC in accordance with [Lee et al., 2014] does not yield proper results either because CCA is rated as the best projection by far, and the other three are rated very similar. The RAAR (Figure A.2) curves do not lead to correct interpretations. Zrehen's measure evaluates t-SNE (1) as a better projection than PCA or CCA, and is only able to depict t-SNE (2) as the worst one.

The precision and recall measures validate that t-SNE minimizes the recall. The measures clearly separate CCA and PCA projections from t-SNE's but cannot distinguish between PCA and CCA projections (see Figure A.3).

On the other hand, the four Shepard Diagrams make it possible to clearly distinguish all four cases. Accordingly, the scatter plot of PCA is distinctly correlated, CCA has some errors on the right corner, t-SNE (1) has problems with density and in t-SNE (2) the distances are randomly distributed. The results of the Shepard Diagram seem to be captured quite well by Kendalls (Table A.1).

Figure A.1: Trustworthiness and Continuity [Kaski et al., 2003] of the four projections for the first 50 k nearest neighbors. T-SNE (1) instead of PCA has the best values for the first 30 knn, but t-SNE (1) projection does not represent the density of the data set and adds some gaps (see supplementary). From 30 to 50 knn it is unclear if one should prefer CCA or the PCA projection, but CCA disrupts one cluster (see supplementary) by adding additional gaps. The worst projection, t-SNE (2), can be clearly distinguished. The curves do not change their ranks for figures above 50 knn.



Figure A.3: For the Smoothed Precision and recall of Hepta one could prefer either the CCA or PCA projection. The quality measure shows that t-SNE maximizes the recall. One may also choose the best projection depending on the preference for recall over precision, or vice versa.

#### **Second Example: Chainlink**

In this instance the projections of PCA and two different trials of CCA which yield different results are evaluated. The projections are shown in Fig 4. Both CCA projections were computed using the same set of parameters, but the outcome is not deterministic. Instead, the quality of the projection depends on the trial. The PCA projection completely fails to preserve the structures, and the reason is that the PCA only rotates the data set and the discontinuities are not linearly separable. The first CCA (1) projection shows good quality structure preservation but the second CCA (2) projection cuts one cluster in half and projects it in the middle of the second cluster, thus disrupting discontinuities in the input space by letting intruding points inbetween. This example illustrates, that for high structure preservation it is sometimes necessary to make higher BPE/FPE errors. A smaller BPE/FPE in CCA (2) does not yield to higher structure preservation, because CCA (2) projections results in additional gaps (Figure A.4). The evaluation of QMs is restricted to the Sheppard Density Plot with Kendall's , the Cwiring measure, precision and recall (Figure A.5), and Trustworthiness and Discontinuity (T&D in Figure A.6) which were the best approaches in the first example. In terms of the CCA and PCA projection of Hepta, the results of precision and recall, as well as of Classification error, were ambiguous. Thus, they are added for the projections of the Chainlink dataset. One could argue that T&D alone cannot distinguish gaps of lower relevance (some points are in the wrong neighborhood) and data density. Hence, results are shown in Fig 6 for the Chainlink data set. The Sheppard Density Plot and Kendall's are not able to measure structure preservation. This is because the structures of the data sets are not based on compact structures; each ring is closer to some points of the other class than to points of its own class. Cwiring also fails completely. The difference in the T&D measure is very small (<3%). Discontinuity ranks PCA as the best projection, for Trustworthiness CCA (2) ranks highest for the first 50 knn, and thereafter CCA (1). For the PCA projection, recall is clearly much better than for both CCA projections. For the CCA (1) projection, precision is a slightly better than for the CCA (2) projection. However, the best projection may be chosen according to the preference for recall over precision or vice versa.

The classification error is exact zero for both CCA projections. They cannot be distinguished. The PCA projection has a slightly above zero error of 0.3% although the structure preservation is very low.

Table A.2: Cwiring results in three projections of the dataset whereby Chainlink is sorted from the worst to the best structure preservation. The CCA projection is ranked worse than PCA projection. However one CCA projection preserves structures significantly better than the PCA projection. For Kendall's the PCA projection is ranked as the best.


Figure A.4: Chainlink Projection by the PCA and CCA methods. The PCA projection overlaps the clusters, as CCA shows three clearly separated clusters in the first trial (CCA wrong), and preserves the cluster structure in the second trial (CCA correct).

Figure A.5: Smoothed Precision and Recall of Chainlink. It is unclear which projection is structure preserving, but the projections of CCA can be distinguished from each other.

Figure A.6: T&D for the Chainlink data set. For Discontinuity PCA is clearly regarded as the best projection, while the CCA (2) projection is most ideal for Trustworthiness up to the first 50 knn and after that the CCA (1) projection is most suitable. Compared to Figure A.2 of the supplementary, the CCA (1) projection is clearly the best one. Note, that the difference between the three projections is only around 3 percent, but the visual differences in Figure A.2 are clear.

#### **Supplement B: Wine Dataset Distance Distribution**

Only Euclidean distances (Figure B.7) were used for SOP, consistent with the settings defined by [Herrmann, 2011, p. 98] and the restrictions of the source code. For Pswarm the squared Euclidean distances were used, because they are slightly more bimodal (Figure B.8) indicating a better distinction between inter and intracluster distances, for further details see chapter 3, Figure 3.1. Distance distributions was generated using the AdaptGauss CRAN package [Thrun/Ultsch, 2015; Ultsch et al., 2015].

Figure B.7: Distribution of Euclidean distances visualized by histogram, PDEplot, QQplot, Boxplot and the amount of NaNs: The distribution is in the first approximation unimodal.

Figure B.8: Distribution of squared Euclidean distances visualized by histogram, PDEplot, QQplot, Boxplot and the amount of NaNs: The distribution is in the first approximation bimodal distinguishing intra- and inter-cluster distances.

#### **Supplement C: Generalized Umatrix of Pswarm and SOP**

Supplement C compares the visualizations of DBS through the projection method of Pswarm with the Generalized U-Matrix of SOP for all data sets introduced in chapter 9 which were not shown in this work up until now.

Figure C.9: Topographic map of the Swiss Banknotes data set projected using SOP with the default parameters: The hills of the generalized U-matrix indicate 3 clusters, and one green point is misplaced in the small cluster.

Figure C.10: Topographic map of the Swiss Banknotes data set projected using DBS (36x40) with an automatically chosen lattice size: Two clusters are clearly visible, with two misplaced points. The clustering accuracy of the DBS projection is 99%.

Figure C.11: Topographic map of the Wine data set projected using SOP with the default parameters: The cluster structure is intertwined. Without the colored labels, the clusters could not be identified.

Figure C.12: Topographic map of the Wine data set projected using DBS (28x32) with an automatically chosen lattice size and squared Euclidean distances: The first cluster (green, right) is rectangular in form, the second cluster (blue, left) is square, and the third (pink, bottom) is triangular. The DBS projection yields a clustering accuracy of 92%.

Figure C.13: Topographic map of the Iris data set projected using SOP with the default parameters: One cluster (green) is clearly visible, but the other two clusters (pink and blue) are not correctly reproduced because too many points (11%) are misplaced. The radius of the P-matrix was automatically chosen to be 1.38.

Figure C.14: Topographic map of the Iris data set projected using DBS (26x28) with an automatically chosen lattice size: Three clusters are clearly visible, but with five misplaced points. The points in the first cluster (green) are clearly separated, and the second cluster (blue) has a much higher density than the third cluster (pink). The clustering accuracy of the DBS projection is 99%.

Figure C.15: Topographic map of the Atom data set projected using SOP with the default parameters: The projection shows hills separating parts of the green-labeled cluster. Without the labels corresponding to the prior classification, three clusters would be seen.

Figure C.16: Topographic map of the Atom data set projected using DBS (58x60) with an automatically chosen lattice size: Two clusters are visible, without any substructures. The clustering accuracy of the DBS projection is 100%.

Figure C.17: Topographic map of the Chainlink data set projected using SOP with the default parameters: Two clusters are visible, with two points that could be misinterpreted as outlier points (the green point is shown twice here). The projection is not smooth, as seen from the hilly substructures evident in the clusters.

Figure C.18: Topographic map of the Chainlink data set projected using DBS (64x64) with an automatically chosen lattice size: Two clusters are clearly visible, but there is one point that could be misinterpreted as an outlier point (shown twice here). The projection is smoother than that of SOP, as seen from the fact that no hills are visible within the clusters. The clustering accuracy of the DBS projection is 100%.

#### **Supplement D: DBS Visualizations of S-shape and uniform Cuboid**

In Figure D.19 it is verified that DBS does not visualize any structures in a data set if the data set does not contain structures.

Figure D.19: Topographic maps of three data sets by DBS which do not contain any natural cluster structure. The visualizations show that a cluster structure cannot be seen. Top: cuboid with uniform distributed points; Middle: cuboid with Gaussian distributed points; Down: S-share data sets (see chapter 9 for data set descriptions).

#### **Supplement E: U-Matrix Visualizations of ESOM Projections**

All source code was executed in R 3.2.3 [R project, , 2008] on a Windows 7, 64bit system. The ESOM parameterization was chosen for a 50x82 sized toroidal lattice with Gaussian neighborhood function. Further parameterization for the annealing scheme were: 20 epochs, the global neighborhood (learning) radius Rmax=24 and Rmin=1, and the learning rate started at 0.5 and ended at 0.1. The visualization of Fig E.120 E.21, E.22, E.23 are compared in chapter 10.3 to the DBS visualizations.

Figure E.20: ESOM projection and U-matrix visualization on Wine data set. The clusters are difficult to separate without the colored labels. Many points are misplaced.

Figure E.21: ESOM projection and U-matrix visualization on Swiss banknotes data set. One best matching unit is misplaced. The cluster with blue best matching units could be interpreted as a small and a big cluster because of the high hills in-between.

Figure E.22: ESOM projection and U\*-matrix visualization of Iris data set. With default parameters the clusters with blue and pink best matching unit cannot be separated.

Figure E.23: ESOM projection and U-matrix visualization of Wing Nut data set. If the default parametrization of ESOM is not changed from toroid to planar, the structures of the clusters are very difficult to see.

#### **Supplement F: Statistical Tests in Hydrology**

Tab F.3 and F.4 compare the clustering achieved in chapter 12.1 for conductivity and for nitrate. The clusters should contain samples of different natures and based on different processes. Given this assumption, it is valid to statistically test whether the N&C distributions significantly differ between clusters. The Kolmogorov–Smirnov test (KS test) is a nonparametric two-sample test of the null hypothesis that two variables are drawn from the same continuous distribution [Conover, 1971, pp. 309-314. All N&C distributions significantly differ between clusters, with the exception of cluster 4 compared with 5.

Table F.3: KS-test with test statistics *D* and p-value *p* for conductivity. The null hypothesis for cluster 4 and 5 could not be disproved. Cluster No. C1 (223) C2 (87) C3 (21) C4 (7) C5 (5)


Table F.4: KS-test test with test statistics *D* and p-value *p* for nitrate. The null hypothesis for cluster 4 and 5 could not be disproved.


### **Supplement G: 3D Prints of Generalized Umatrix Visualizations of DBS**

In Fig. G.1 and G.2 the 3D prints of the visualizations of chapter 12 are shown . [Thrun et al., 2016a].

Figure G.24: 3D print of the topographic map of DBS the Hydrology data set of chapter 12, Figure 12.4 (cf. [Thrun et al., 2016a]), colors are not available yet due to technical limitations.

Figure G.25: 3D print of the topographic map DBS of pain genes of chapter 12, Figure 12.9 (cf. [Thrun et al., 2016a]), colors are not available yet due to technical limitations.

#### **Supplement H: Contingency Table for Tetragonula Bees Clustering**

Chapter 11.3 introduces the Databionic swarm clustering of the Tetragonula Bees data set and evaluates it with the unsupervised indices of the heatmap and the Silhouette plot. In addition Tab H.5 evaluates the clustering by comparing it to the clustering of [Hennig 2014] by using a contingency table. Besides cluster 6 both clusterings are similar to each other.



#### **Supplement I: Statistical Tests for FCPS clustering compared to DBS**

In Tab I.6 the p-values of the Bonferroni adjusted Wilcoxon rank sum test of the results in chapter 10 Figure 10.1 are presented. If the p-value is lower than 0.05, then DBS outperforms the other clustering method significantly.

Table I.6: Wilcoxon rank sum test for Fig. 10.1 in chapter 10. Abbreviations: single linkage (SL), Linde-Buzo-Gray algorithm (LBG-kMeans), partitioning around medoids (PAM), mixtures-of-Guassians clustering (MoG) also known as model based clustering


**Open Access** This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Index**


157, 158, 161, 186, 187, 188, 189, 190, 191, 192, 195, 196, 197


© The Author(s) 2018

M. C. Thrun, *Projection-Based Clustering through Self-Organization and Swarm Intelligence*, https://doi.org/10.1007/978-3-658-20540-9


**Open Access** This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.